linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
@ 2017-01-27 22:52 Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 01/14] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans, Jérôme Glisse

Cliff note: HMM offers 2 things (each standing on its own). First
it allows to use device memory transparently inside any process
without any modifications to process program code. Second it allows
to mirror process address space on a device.

Change since v16:
  - move HMM unaddressable device memory to its own radix tree and
    thus find_dev_pagemap() will no longer return HMM dev_pagemap
  - rename HMM migration helper (drop the prefix) and make them
    completely independent of HMM

    Migration can now be use to implement thing like multi-threaded
    copy or make use of specific memory allocator for destination
    memory.

Work is under way to use this feature inside nouveau (the upstream
open source driver for NVidia GPU) either 411 or 4.12 timeframe.
But this patchset have been otherwise tested with the close source
driver for NVidia GPU and thus we are confident it works and allow
to use the hardware for seamless interaction between CPU and GPU
in common address space of a process.

I also discussed the features with other company and i am confident
it can be use on other, yet, unrelease hardware.

Please condiser applying for 4.11


Know issues:

Device memory pick some random unuse physical address range. Latter
memory hotplug might fails because of this. Intention is to fix this
in latter patchset to use physical address above the platform limit
thus making sure that no real memory can be hotplug at conflicting
address.


Patchset overview:

Patchset is divided into 3 features that can each be use independently
from one another. First is changes to ZONE_DEVICE so we can have struct
page for device un-addressable memory (patch 1-4 and 13-14). Second is
process address space mirroring (patch 8 to 11), this allow to snapshot
CPU page table and to keep the device page table synchronize with the
CPU one.

Last is a new page migration helper which allow migration for range of
virtual address using hardware copy engine (patch 5-7 for new migrate
function and 12 for migration of un-addressable memory).


Future plan:

In this patchset i restricted myself to set of core features what
is missing:
  - force read only on CPU for memory duplication and GPU atomic
  - changes to mmu_notifier for optimization purposes
  - migration of file back page to device memory

I plan to submit a couple more patchset to implement those features
once core HMM is upstream.

Git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v17


Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/
    v14 https://lkml.org/lkml/2016/12/8/344
    v15 http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1304107.html
    v16 http://www.spinics.net/lists/linux-mm/msg119814.html

Jérôme Glisse (14):
  mm/memory/hotplug: convert device bool to int to allow for more flags
    v2
  mm/ZONE_DEVICE/free-page: callback when page is freed v2
  mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
    memory v3
  mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  mm/migrate: add new boolean copy flag to migratepage() callback
  mm/migrate: new memory migration helper for use with device memory v3
  mm/migrate: migrate_vma() unmap page from vma while collecting pages
  mm/hmm: heterogeneous memory management (HMM for short)
  mm/hmm/mirror: mirror process address space on device with HMM helpers
  mm/hmm/mirror: helper to snapshot CPU page table
  mm/hmm/mirror: device page fault handler
  mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2

 MAINTAINERS                                |    7 +
 arch/ia64/mm/init.c                        |   23 +-
 arch/powerpc/mm/mem.c                      |   22 +-
 arch/s390/mm/init.c                        |   10 +-
 arch/sh/mm/init.c                          |   22 +-
 arch/tile/mm/init.c                        |   10 +-
 arch/x86/mm/init_32.c                      |   23 +-
 arch/x86/mm/init_64.c                      |   41 +-
 drivers/staging/lustre/lustre/llite/rw26.c |    8 +-
 fs/aio.c                                   |    7 +-
 fs/btrfs/disk-io.c                         |   11 +-
 fs/hugetlbfs/inode.c                       |    9 +-
 fs/nfs/internal.h                          |    5 +-
 fs/nfs/write.c                             |    9 +-
 fs/proc/task_mmu.c                         |   10 +-
 fs/ubifs/file.c                            |    8 +-
 include/linux/balloon_compaction.h         |    3 +-
 include/linux/fs.h                         |   13 +-
 include/linux/hmm.h                        |  464 +++++++++++
 include/linux/ioport.h                     |    1 +
 include/linux/memory_hotplug.h             |   31 +-
 include/linux/memremap.h                   |   39 +-
 include/linux/migrate.h                    |   83 +-
 include/linux/mm_types.h                   |    5 +
 include/linux/swap.h                       |   18 +-
 include/linux/swapops.h                    |   67 ++
 kernel/fork.c                              |    2 +
 kernel/memremap.c                          |   31 +-
 mm/Kconfig                                 |   38 +
 mm/Makefile                                |    1 +
 mm/balloon_compaction.c                    |    2 +-
 mm/hmm.c                                   | 1235 ++++++++++++++++++++++++++++
 mm/memory.c                                |   64 +-
 mm/memory_hotplug.c                        |   14 +-
 mm/migrate.c                               |  659 ++++++++++++++-
 mm/mprotect.c                              |   12 +
 mm/rmap.c                                  |   47 ++
 mm/zsmalloc.c                              |   12 +-
 38 files changed, 2986 insertions(+), 80 deletions(-)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.4.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [HMM v17 01/14] mm/memory/hotplug: convert device bool to int to allow for more flags v2
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 02/14] mm/ZONE_DEVICE/free-page: callback when page is freed v2 Jérôme Glisse
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Russell King, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

When hotpluging memory we want more informations on the type of memory and
its properties. Replace the device boolean flag by an int and define a set
of flags.

New property for device memory is an opt-in flag to allow page migration
from and to a ZONE_DEVICE. Existing user of ZONE_DEVICE are not expecting
page migration to work for their pages. New changes to page migration i
changing that and we now need a flag to explicitly opt-in page migration.

Changes since v1:
  - Improved commit message
  - Improved define name
  - Improved comments
  - Typos

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/ia64/mm/init.c            | 23 ++++++++++++++++++++---
 arch/powerpc/mm/mem.c          | 22 +++++++++++++++++++---
 arch/s390/mm/init.c            | 10 ++++++++--
 arch/sh/mm/init.c              | 22 +++++++++++++++++++---
 arch/tile/mm/init.c            | 10 ++++++++--
 arch/x86/mm/init_32.c          | 23 ++++++++++++++++++++---
 arch/x86/mm/init_64.c          | 23 ++++++++++++++++++++---
 include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
 include/linux/memremap.h       | 11 +++++++++++
 kernel/memremap.c              |  4 ++--
 mm/memory_hotplug.c            |  4 ++--
 11 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 1841ef6..303027e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,18 +645,27 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	pg_data_t *pgdat;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 
 	if (ret)
@@ -667,13 +676,21 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..6e877d3 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,14 +126,22 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdata;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int rc;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
@@ -147,19 +155,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 
 	/* this should work for most non-highmem platforms */
 	zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, 0, for_device);
+		zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index b3e9d18..e94b9e1 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -149,7 +149,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
 	unsigned long start_pfn = PFN_DOWN(start);
@@ -158,6 +158,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	struct zone *zone;
 	int rc, i;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags) {
+		BUG();
+		return -EINVAL;
+	}
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
@@ -201,7 +207,7 @@ unsigned long memory_block_size_bytes(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..0ca69ac 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,19 +485,27 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	pg_data_t *pgdat;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
 	ret = __add_pages(nid, pgdat->node_zones +
 			zone_for_memory(nid, start, size, ZONE_NORMAL,
-			for_device),
+					flags & MEMORY_DEVICE),
 			start_pfn, nr_pages);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
@@ -516,13 +524,21 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (unlikely(ret))
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index adce254..ba001b1 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -863,13 +863,19 @@ void __init mem_init(void)
  * memory to the highmem for now.
  */
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size, bool for_device)
+int arch_add_memory(u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdata = &contig_page_data;
 	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(zone, start_pfn, nr_pages);
 }
 
@@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/* TODO */
 	return -EBUSY;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cf80590..8287a4b 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,24 +816,41 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdata = NODE_DATA(nid);
 	struct zone *zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
+		zone_for_memory(nid, start, size, ZONE_HIGHMEM,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	return __remove_pages(zone, start_pfn, nr_pages);
 }
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 14b9dd7..442ac86 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -651,15 +651,24 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -956,8 +965,10 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(start_pfn);
@@ -965,6 +976,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 01033fa..3f50eb8 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -103,7 +103,7 @@ extern bool memhp_auto_online;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, int flags);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -275,7 +275,27 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+
+/*
+ * When hotpluging memory with arch_add_memory() we want more informations on
+ * the type of memory and its properties. The flags parameter allow to provide
+ * more informations on the memory which is being addedd.
+ *
+ * Provide an opt-in flag for struct page migration. Persistent device memory
+ * never relied on struct page migration so far and new user of might also
+ * prefer avoiding struct page migration.
+ *
+ * New non device memory specific flags can be added if ever needed.
+ *
+ * MEMORY_REGULAR: regular system memory
+ * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
+ * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ */
+#define MEMORY_NORMAL 0
+#define MEMORY_DEVICE (1 << 0)
+#define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+
+extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..f7e0609 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,6 +53,12 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -71,6 +77,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..07665eb 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size);
+	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -358,7 +358,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, true);
+	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
 	if (error)
 		goto err_add_memory;
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e43142c1..096c651 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1372,7 +1372,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	}
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, false);
+	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
 
 	if (ret < 0)
 		goto error;
@@ -2156,7 +2156,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
+	arch_remove_memory(start, size, MEMORY_NORMAL);
 
 	try_offline_node(nid);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 02/14] mm/ZONE_DEVICE/free-page: callback when page is freed v2
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 01/14] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 03/14] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v3 Jérôme Glisse
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Dan Williams, Ross Zwisler

When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
is holding a reference on it (only device to which the memory belong do).
Add a callback and call it when that happen so device driver can implement
their own free page management.

Changed since v1:
  - Do not update devm_memremap_pages() to take extra argument

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h | 8 +++++++-
 kernel/memremap.c        | 8 ++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f7e0609..06fa74b 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,23 +35,29 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_free: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @data: privata data pointer for page_free
  */
 struct dev_pagemap {
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap);
+			  struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 
 static inline bool dev_page_allow_migrate(const struct page *page)
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07665eb..2f37c92 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (page->pgmap->page_free && page_count(page) == 1)
+		page->pgmap->page_free(page, page->pgmap->data);
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
@@ -322,6 +328,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->page_free = NULL;
+	pgmap->data = NULL;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 03/14] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v3
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 01/14] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 02/14] mm/ZONE_DEVICE/free-page: callback when page is freed v2 Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 04/14] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Dan Williams, Ross Zwisler

This add support for un-addressable device memory. Such memory is hotpluged
only so we can have struct page but we should never map them as such memory
can not be accessed by CPU. For that reason it uses a special swap entry for
CPU page table entry.

This patch implement all the logic from special swap type to handling CPU
page fault through a callback specified in the ZONE_DEVICE pgmap struct.

Architecture that wish to support un-addressable device memory should make
sure to never populate the kernel linar mapping for the physical range.

This feature potentially breaks memory hotplug unless every driver using it
magically predicts the future addresses of where memory will be hotplugged.

Changed since v2:
  -  Do not change devm_memremap_pages()
Changed since v1:
  - Add unaddressable memory resource descriptor enum
  - Explain why memory hotplug can fail because of un-addressable memory

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c             | 10 ++++++-
 include/linux/ioport.h         |  1 +
 include/linux/memory_hotplug.h |  7 +++++
 include/linux/memremap.h       | 20 +++++++++++++
 include/linux/swap.h           | 18 ++++++++++--
 include/linux/swapops.h        | 67 ++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c              | 23 +++++++++++++--
 mm/Kconfig                     | 12 ++++++++
 mm/memory.c                    | 64 +++++++++++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c            | 10 +++++--
 mm/mprotect.c                  | 12 ++++++++
 11 files changed, 235 insertions(+), 9 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 958f325..9a6ab71 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -535,8 +535,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			} else {
 				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
 			}
-		} else if (is_migration_entry(swpent))
+		} else if (is_migration_entry(swpent)) {
 			page = migration_entry_to_page(swpent);
+		} else if (is_device_entry(swpent)) {
+			page = device_entry_to_page(swpent);
+		}
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -699,6 +702,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1182,6 +1187,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_entry(entry))
+			page = device_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..d154a18 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
 	IORES_DESC_ACPI_NV_STORAGE		= 3,
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
+	IORES_DESC_UNADDRESSABLE_MEMORY		= 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3f50eb8..e7c5dc6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -285,15 +285,22 @@ extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
  * never relied on struct page migration so far and new user of might also
  * prefer avoiding struct page migration.
  *
+ * For device memory (which use ZONE_DEVICE) we want differentiate between CPU
+ * accessible memory (persitent memory, device memory on an architecture with a
+ * system bus that allow transparent access to device memory) and unaddressable
+ * memory (device memory that can not be accessed by CPU directly).
+ *
  * New non device memory specific flags can be added if ever needed.
  *
  * MEMORY_REGULAR: regular system memory
  * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
  * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ * MEMORY_DEVICE_UNADDRESSABLE: un-addressable memory (CPU can not access it)
  */
 #define MEMORY_NORMAL 0
 #define MEMORY_DEVICE (1 << 0)
 #define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+#define MEMORY_DEVICE_UNADDRESSABLE (1 << 2)
 
 extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 06fa74b..041d5b9 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,24 +35,33 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				struct page *page,
+				unsigned flags,
+				pmd_t *pmdp);
 typedef void (*dev_page_free_t)(struct page *page, void *data);
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an un-addressable device page
  * @page_free: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
  * @data: privata data pointer for page_free
+ * @flags: device memory flags (look for MEMORY_DEVICE_* memory_hotplug.h)
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
 	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
 	void *data;
+	int flags;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
@@ -65,6 +74,12 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 	return ((page_zonenum(page) == ZONE_DEVICE) &&
 		(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return ((page_zonenum(page) != ZONE_DEVICE) ||
+		!(page->pgmap->flags & MEMORY_DEVICE_UNADDRESSABLE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -88,6 +103,11 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return true;
+}
 #endif
 
 /**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09b212d..81b44ea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Un-addressable device memory support
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -410,8 +422,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..0e339f0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 2f37c92..a7334fa 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -200,6 +202,21 @@ void put_zone_device_page(struct page *page)
 }
 EXPORT_SYMBOL(put_zone_device_page);
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_entry_to_page(entry);
+
+	BUG_ON(!page->pgmap->page_fault);
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -252,7 +269,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
+	arch_remove_memory(align_start, align_size, pgmap->flags);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -328,6 +345,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->flags = MEMORY_DEVICE;
+	pgmap->page_fault = NULL;
 	pgmap->page_free = NULL;
 	pgmap->data = NULL;
 
@@ -366,7 +385,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
+	error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
 	if (error)
 		goto err_add_memory;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index e9b7c7e..0c33f46 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,18 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_UNADDRESSABLE
+	bool "Un-addressable device memory (GPU memory, ...)"
+	depends on ZONE_DEVICE
+
+	help
+	  Allow to create struct page for un-addressable device memory
+	  ie memory that is only accessible by the device (or group of
+	  devices).
+
+	  Having struct page is necessary for process memory migration
+	  to device memory.
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index e870322..69bede9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -890,6 +891,25 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_entry(entry)) {
+			page = device_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for un-addressable page as
+			 * they should be consider just like any other page.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			if (is_write_device_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*src_pte))
+					pte = pte_swp_mksoft_dirty(pte);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1179,6 +1199,32 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
+
+		/*
+		 * Un-addressable page must always be check that are not like
+		 * other swap entries and thus should be check no matter what
+		 * details->check_swap_entries value is.
+		 */
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_entry(entry)) {
+			struct page *page = device_entry_to_page(entry);
+
+			if (unlikely(details)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping &&
+				    details->check_mapping != page_rmapping(page))
+					continue;
+			}
+
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+		}
+
 		/* only check swap_entries if explicitly asked for in details */
 		if (unlikely(details && !details->check_swap_entries))
 			continue;
@@ -2550,6 +2596,14 @@ int do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_entry_fault(vma, vmf->address, entry,
+						 vmf->flags, vmf->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
@@ -3518,6 +3572,7 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 static int handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
+	struct page *page;
 
 	if (unlikely(pmd_none(*vmf->pmd))) {
 		/*
@@ -3568,9 +3623,16 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
+	/* Catch mapping of un-addressable memory this should never happen */
+	entry = vmf->orig_pte;
+	page = pfn_to_page(pte_pfn(entry));
+	if (!is_addressable_page(page)) {
+		print_bad_pte(vmf->vma, vmf->address, entry, page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
 	if (vmf->flags & FAULT_FLAG_WRITE) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 096c651..76f5359 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -149,7 +149,7 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res;
+	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
@@ -158,7 +158,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	if (request_resource(&iomem_resource, res) < 0) {
+	conflict =  request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (conflict->desc == IORES_DESC_UNADDRESSABLE_MEMORY) {
+			pr_debug("Device un-addressable memory block "
+				 "memory hotplug at %#010llx !\n",
+				 (unsigned long long)start);
+		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cc2459c..fc3dd08 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -140,6 +140,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_entry(entry)) {
+				pte_t newpte;
+
+				make_device_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 04/14] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 03/14] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v3 Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 05/14] mm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.

Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/mm/init_64.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 442ac86..6e7f613 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -654,7 +654,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	const int supported_flags = MEMORY_DEVICE |
-				    MEMORY_DEVICE_ALLOW_MIGRATE;
+				    MEMORY_DEVICE_ALLOW_MIGRATE |
+				    MEMORY_DEVICE_UNADDRESSABLE;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL,
@@ -669,7 +670,17 @@ int arch_add_memory(int nid, u64 start, u64 size, int flags)
 		return -EINVAL;
 	}
 
-	init_memory_mapping(start, start + size);
+	/*
+	 * We get un-addressable memory when some one is adding a ZONE_DEVICE
+	 * to have struct page for a device memory which is not accessible by
+	 * the CPU so it is pointless to have a linear kernel mapping of such
+	 * memory.
+	 *
+	 * Core mm should make sure it never set a pte pointing to such fake
+	 * physical range.
+	 */
+	if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+		init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
@@ -968,7 +979,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
 	const int supported_flags = MEMORY_DEVICE |
-				    MEMORY_DEVICE_ALLOW_MIGRATE;
+				    MEMORY_DEVICE_ALLOW_MIGRATE |
+				    MEMORY_DEVICE_UNADDRESSABLE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(start_pfn);
@@ -989,7 +1001,9 @@ int __ref arch_remove_memory(u64 start, u64 size, int flags)
 	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-	kernel_physical_mapping_remove(start, start + size);
+
+	if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+		kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 05/14] mm/migrate: add new boolean copy flag to migratepage() callback
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 04/14] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 06/14] mm/migrate: new memory migration helper for use with device memory v3 Jérôme Glisse
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans, Jérôme Glisse

Allow migration without copy in case destination page already have
source page content. This is usefull for new dma capable migration
where use device dma engine to copy pages.

This feature need carefull audit of filesystem code to make sure
that no one can write to the source page while it is unmapped and
locked. It should be safe for most filesystem but as precaution
return error until support for device migration is added to them.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/staging/lustre/lustre/llite/rw26.c |  8 +++--
 fs/aio.c                                   |  7 +++-
 fs/btrfs/disk-io.c                         | 11 ++++--
 fs/hugetlbfs/inode.c                       |  9 +++--
 fs/nfs/internal.h                          |  5 +--
 fs/nfs/write.c                             |  9 +++--
 fs/ubifs/file.c                            |  8 ++++-
 include/linux/balloon_compaction.h         |  3 +-
 include/linux/fs.h                         | 13 ++++---
 include/linux/migrate.h                    |  7 ++--
 mm/balloon_compaction.c                    |  2 +-
 mm/migrate.c                               | 56 +++++++++++++++++++-----------
 mm/zsmalloc.c                              | 12 ++++++-
 13 files changed, 106 insertions(+), 44 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index 26f3a37..5a225ca 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -43,6 +43,7 @@
 #include <linux/uaccess.h>
 
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/mpage.h>
@@ -635,9 +636,12 @@ static int ll_write_end(struct file *file, struct address_space *mapping,
 #ifdef CONFIG_MIGRATION
 static int ll_migratepage(struct address_space *mapping,
 			  struct page *newpage, struct page *page,
-			  enum migrate_mode mode
-		)
+			  enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/* Always fail page migration until we have a proper implementation */
 	return -EIO;
 }
diff --git a/fs/aio.c b/fs/aio.c
index 428484f..30cf06c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -37,6 +37,7 @@
 #include <linux/blkdev.h>
 #include <linux/compat.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
@@ -366,13 +367,17 @@ static const struct file_operations aio_ring_fops = {
 
 #if IS_ENABLED(CONFIG_MIGRATION)
 static int aio_migratepage(struct address_space *mapping, struct page *new,
-			struct page *old, enum migrate_mode mode)
+			   struct page *old, enum migrate_mode mode, bool copy)
 {
 	struct kioctx *ctx;
 	unsigned long flags;
 	pgoff_t idx;
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(new))
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3a57f99..6ccd3c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -27,6 +27,7 @@
 #include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ratelimit.h>
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
@@ -1046,9 +1047,13 @@ static int btree_submit_bio_hook(struct inode *inode, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page,
-			enum migrate_mode mode)
+			     struct page *newpage, struct page *page,
+			     enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * we can't safely write a btree page from here,
 	 * we haven't done the locking hook
@@ -1062,7 +1067,7 @@ static int btree_migratepage(struct address_space *mapping,
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4fb7b10..b52dd44 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -35,6 +35,7 @@
 #include <linux/security.h>
 #include <linux/magic.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/uio.h>
 
 #include <asm/uaccess.h>
@@ -842,11 +843,15 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 }
 
 static int hugetlbfs_migrate_page(struct address_space *mapping,
-				struct page *newpage, struct page *page,
-				enum migrate_mode mode)
+				  struct page *newpage, struct page *page,
+				  enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 80bcc0b..12d9d8d 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -535,8 +535,9 @@ void nfs_clear_pnfs_ds_commit_verifiers(struct pnfs_ds_commit_info *cinfo)
 #endif
 
 #ifdef CONFIG_MIGRATION
-extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *, enum migrate_mode);
+extern int nfs_migrate_page(struct address_space *mapping,
+			    struct page *newpage, struct page *page,
+			    enum migrate_mode, bool copy);
 #endif
 
 static inline int
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5321183..d7130a5 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 #include <linux/sunrpc/clnt.h>
 #include <linux/nfs_fs.h>
@@ -2023,8 +2024,12 @@ int nfs_wb_single_page(struct inode *inode, struct page *page, bool launder)
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		     struct page *page, enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
 	 * an in-progress read or write request. Don't try to migrate it.
@@ -2039,7 +2044,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 	if (!nfs_fscache_release_page(page, GFP_KERNEL))
 		return -EBUSY;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index b4fbeef..f625cac 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -53,6 +53,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 static int read_block(struct inode *inode, void *addr, unsigned int block,
 		      struct ubifs_data_node *dn)
@@ -1455,10 +1456,15 @@ static int ubifs_set_page_dirty(struct page *page)
 
 #ifdef CONFIG_MIGRATION
 static int ubifs_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			      struct page *newpage, struct page *page,
+			      enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 79542b2..27cf3e3 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -85,7 +85,8 @@ extern bool balloon_page_isolate(struct page *page,
 extern void balloon_page_putback(struct page *page);
 extern int balloon_page_migrate(struct address_space *mapping,
 				struct page *newpage,
-				struct page *page, enum migrate_mode mode);
+				struct page *page, enum migrate_mode mode,
+				bool copy);
 
 /*
  * balloon_page_insert - insert a page into the balloon's page list and make
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2f63d44..431f0d3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -398,8 +398,9 @@ struct address_space_operations {
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
 	 */
-	int (*migratepage) (struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+	int (*migratepage)(struct address_space *mapping,
+			   struct page *newpage, struct page *page,
+			   enum migrate_mode, bool copy);
 	bool (*isolate_page)(struct page *, isolate_mode_t);
 	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
@@ -3010,9 +3011,11 @@ extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
 extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *,
-				enum migrate_mode);
+extern int buffer_migrate_page(struct address_space *mapping,
+			       struct page *newpage,
+			       struct page *page,
+			       enum migrate_mode,
+			       bool copy);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ae8d475..37b77ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -33,8 +33,11 @@ extern char *migrate_reason_names[MR_TYPES];
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
-extern int migrate_page(struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+extern int migrate_page(struct address_space *mapping,
+			struct page *newpage,
+			struct page *page,
+			enum migrate_mode,
+			bool copy);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
 extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..ed5cacb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -135,7 +135,7 @@ void balloon_page_putback(struct page *page)
 /* move_to_new_page() counterpart for a ballooned page */
 int balloon_page_migrate(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		enum migrate_mode mode, bool copy)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 0ed24b1..567674d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -613,18 +613,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 	}
 }
 
-/*
- * Copy the page to its new location
- */
-void migrate_page_copy(struct page *newpage, struct page *page)
+static void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -678,6 +670,19 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -691,8 +696,8 @@ EXPORT_SYMBOL(migrate_page_copy);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		 struct page *newpage, struct page *page,
+		 enum migrate_mode mode, bool copy)
 {
 	int rc;
 
@@ -703,7 +708,11 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
+
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -715,13 +724,14 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			struct page *newpage, struct page *page,
+			enum migrate_mode mode, bool copy)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page, mode);
+		return migrate_page(mapping, newpage, page, mode, copy);
 
 	head = page_buffers(page);
 
@@ -753,12 +763,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -813,7 +826,8 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page, enum migrate_mode mode)
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
@@ -830,7 +844,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, true);
 }
 
 /*
@@ -858,7 +872,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	if (likely(is_lru)) {
 		if (!mapping)
-			rc = migrate_page(mapping, newpage, page, mode);
+			rc = migrate_page(mapping, newpage, page, mode, true);
 		else if (mapping->a_ops->migratepage)
 			/*
 			 * Most pages have a mapping and most filesystems
@@ -868,7 +882,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 			 * for page migration.
 			 */
 			rc = mapping->a_ops->migratepage(mapping, newpage,
-							page, mode);
+							page, mode, true);
 		else
 			rc = fallback_migrate_page(mapping, newpage,
 							page, mode);
@@ -885,7 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		}
 
 		rc = mapping->a_ops->migratepage(mapping, newpage,
-						page, mode);
+						page, mode, true);
 		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
 			!PageIsolated(page));
 	}
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b0bc023..bf73222 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -52,6 +52,7 @@
 #include <linux/zpool.h>
 #include <linux/mount.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/pagemap.h>
 
 #define ZSPAGE_MAGIC	0x58
@@ -2015,7 +2016,7 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 }
 
 int zs_page_migrate(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		    struct page *page, enum migrate_mode mode, bool copy)
 {
 	struct zs_pool *pool;
 	struct size_class *class;
@@ -2033,6 +2034,15 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
+	/*
+	 * Offloading copy operation for zspage require special considerations
+	 * due to locking so for now we only support regular migration. I do
+	 * not expect we will ever want to support offloading copy. See hmm.h
+	 * for more informations on hmm_vma_migrate() and offload copy.
+	 */
+	if (!copy || !is_addressable_page(newpage))
+		return -EINVAL;
+
 	zspage = get_zspage(page);
 
 	/* Concurrent compactor cannot migrate any subpage in zspage */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 06/14] mm/migrate: new memory migration helper for use with device memory v3
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 05/14] mm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-02-22 15:31   ` Reza Arbab
  2017-01-27 22:52 ` [HMM v17 07/14] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be usefull for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changed since v2:
  - droped HMM prefix and HMM specific code
Changed since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/migrate.h |  74 ++++++++
 mm/migrate.c            | 449 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 523 insertions(+)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 37b77ba..cd56e41 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -122,4 +122,78 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 }
 #endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/
 
+
+#define MIGRATE_PFN_VALID	(1UL << (BITS_PER_LONG_LONG - 1))
+#define MIGRATE_PFN_MIGRATE	(1UL << (BITS_PER_LONG_LONG - 2))
+#define MIGRATE_PFN_HUGE	(1UL << (BITS_PER_LONG_LONG - 3))
+#define MIGRATE_PFN_LOCKED	(1UL << (BITS_PER_LONG_LONG - 4))
+#define MIGRATE_PFN_WRITE	(1UL << (BITS_PER_LONG_LONG - 5))
+#define MIGRATE_PFN_ZERO	(1UL << (BITS_PER_LONG_LONG - 6))
+#define MIGRATE_PFN_MASK	((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT)) - 1)
+
+static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
+{
+	if (!(mpfn & MIGRATE_PFN_VALID))
+		return NULL;
+	return pfn_to_page(mpfn & MIGRATE_PFN_MASK);
+}
+
+static inline unsigned long migrate_pfn_size(unsigned long mpfn)
+{
+	return mpfn & MIGRATE_PFN_HUGE ? PMD_SIZE : PAGE_SIZE;
+}
+
+/*
+ * struct migrate_vma_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memoiry and copy source to it
+ * @finalize_and_map: allow caller to inspect successfull migrated page
+ *
+ * migrate_vma() allow memory migration to use DMA  engine to perform copy from
+ * source to destination memory it also allow caller to use its own memory
+ * allocator for destination memory.
+ *
+ * Note that in alloc_and_copy device driver can decide not to migrate some of
+ * the entry by simply setting corresponding dst entry 0.
+ *
+ * Destination page must locked and MIGRATE_PFN_LOCKED set in the corresponding
+ * entry of dstarray. It is expected that page allocated will have an elevated
+ * refcount and that a put_page() will free the page.
+ *
+ * Device driver might want to allocate with an extra-refcount if they want to
+ * control deallocation of failed migration inside finalize_and_map() callback.
+ *
+ * The finalize_and_map() callback must use the MIGRATE_PFN_MIGRATE flag to
+ * determine which page have been successfully migrated (it is set in the src
+ * array for each entry that have been successfully migrated).
+ *
+ * For migration from device memory to system memory device driver must set any
+ * dst entry to MIGRATE_PFN_ERROR for any entry it can not migrate back due to
+ * hardware fatal failure that can not be recovered. Such failure will trigger
+ * a SIGBUS for the process trying to access such memory.
+ */
+struct migrate_vma_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const unsigned long *src,
+			       unsigned long *dst,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const unsigned long *src,
+				 unsigned long *dst,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long mentries,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private);
+
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 567674d..150fc4d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -412,6 +412,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2078,3 +2086,444 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+struct migrate_vma {
+	struct vm_area_struct	*vma;
+	unsigned long		*dst;
+	unsigned long		*src;
+	unsigned long		cpages;
+	unsigned long		npages;
+	unsigned long		mpages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static inline int migrate_vma_array_full(struct migrate_vma *migrate)
+{
+	return migrate->npages >= migrate->mpages ? -ENOSPC : 0;
+}
+
+static int migrate_vma_collect_hole(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr, next;
+
+	for (addr = start & PAGE_MASK; addr < end; addr = next) {
+		unsigned long npages, i;
+		int ret;
+
+		next = pmd_addr_end(addr, end);
+		npages = (next - addr) >> PAGE_SHIFT;
+		if (npages == (PMD_SIZE >> PAGE_SHIFT)) {
+			migrate->src[migrate->npages++] = MIGRATE_PFN_HUGE;
+			ret = migrate_vma_array_full(migrate);
+			if (ret)
+				return ret;
+		} else {
+			for (i = 0; i < npages; ++i) {
+				migrate->src[migrate->npages++] = 0;
+				ret = migrate_vma_array_full(migrate);
+				if (ret)
+					return ret;
+			}
+		}
+	}
+
+	return 0;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
+		/* FIXME support THP */
+		return migrate_vma_collect_hole(start, end, walk);
+	}
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	for (; addr < end; addr += PAGE_SIZE, ptep++) {
+		unsigned long flags, pfn;
+		struct page *page;
+		pte_t pte;
+		int ret;
+
+		pte = *ptep;
+		pfn = pte_pfn(pte);
+
+		if (!pte_present(pte)) {
+			flags = pfn = 0;
+			goto next;
+		}
+
+		/* FIXME support THP */
+		page = vm_normal_page(migrate->vma, addr, pte);
+		if (!page || !page->mapping || PageTransCompound(page)) {
+			flags = pfn = 0;
+			goto next;
+		}
+
+		/*
+		 * By getting a reference on the page we pin it and blocks any
+		 * kind of migration. Side effect is that it "freeze" the pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be drop from it).
+		 */
+		get_page(page);
+		migrate->cpages++;
+		flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
+		flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+
+next:
+		migrate->src[migrate->npages++] = pfn | flags;
+		ret = migrate_vma_array_full(migrate);
+		if (ret) {
+			pte_unmap_unlock(ptep, ptl);
+			return ret;
+		}
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * migrate_vma_collect() - collect page over range of virtual address
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This will go over the CPU page table and for each virtual address back by a
+ * valid page it update the src array and take a reference on the page in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void migrate_vma_collect(struct migrate_vma *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = migrate_vma_collect_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = migrate_vma_collect_hole;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+}
+
+/*
+ * migrate_vma_check_page() - check if page is pin or not
+ * @page: struct page to check
+ *
+ * Pinned page can not be migrated. Same test in migrate_page_move_mapping()
+ * except that here we allow migration of ZONE_DEVICE page.
+ */
+static bool migrate_vma_check_page(struct page *page)
+{
+	/*
+	 * One extra ref because caller hold an extra reference either from
+	 * either isolate_lru_page() for regular page or migrate_vma_collect()
+	 * for device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them then regular page because they can be map with a pmd or
+	 * with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This lock pages that have been collected by migrate_vma_collect(). Once page
+ * is locked it is isolated from the lru (for non device page). Finaly the ref
+ * taken by migrate_vma_collect() is drop as locked page can not be migrated by
+ * concurrent kernel thread.
+ */
+static void migrate_vma_prepare(struct migrate_vma *migrate)
+{
+	unsigned long addr = migrate->start, i = 0, size;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (; i < migrate->npages && migrate->cpages; i++, addr += size) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		size = migrate_pfn_size(migrate->src[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec */
+			lru_add_drain_all();
+			allow_drain = false;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+			put_page(page);
+			continue;
+		}
+
+		/* Drop the reference we took in collect */
+		put_page(page);
+
+		if (!migrate_vma_check_page(page)) {
+			migrate->src[i] = 0;
+			unlock_page(page);
+			migrate->cpages--;
+
+			putback_lru_page(page);
+		}
+	}
+}
+
+/*
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration informations
+ *
+ * Replace page mapping (CPU page table pte) with special migration pte entry
+ * and check again if it has be pin. Pin page are restore because we can not
+ * migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy content of original page over to new page.
+ */
+static void migrate_vma_unmap(struct migrate_vma *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	unsigned long addr = migrate->start, i = 0, restore = 0, size;
+
+	for (; addr < migrate->end && migrate->cpages; addr += size, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		size = migrate_pfn_size(migrate->src[i]);
+
+		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
+		}
+	}
+
+	for (; addr < migrate->end && restore; addr += size, i++) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		size = migrate_pfn_size(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		putback_lru_page(page);
+	}
+}
+
+/*
+ * migrate_vma_pages() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This migrate struct page meta-data from source struct page to destination
+ * struct page. This effectively finish the migration from source page to the
+ * destination page.
+ */
+static void migrate_vma_pages(struct migrate_vma *migrate)
+{
+	unsigned long addr = migrate->start, i = 0, size;
+
+	for (; addr < migrate->end; addr += size, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		struct address_space *mapping;
+		int r;
+
+		size = migrate_pfn_size(migrate->src[i]);
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC, false);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+	}
+}
+
+/*
+ * migrate_vma_finalize() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This replace the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page or to the original page
+ * otherwise.
+ *
+ * This also unlock the page and put them back on the lru or drop the extra
+ * ref for device page.
+ */
+static void migrate_vma_finalize(struct migrate_vma *migrate)
+{
+	unsigned long addr = migrate->start, i = 0, size;
+
+	for (; addr<migrate->end && migrate->cpages; addr += size, i++) {
+		struct page *newpage = migrate_pfn_to_page(migrate->dst[i]);
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		size = migrate_pfn_size(migrate->src[i]);
+
+		if (!page)
+			continue;
+		if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE) || !newpage) {
+			if (newpage)
+				put_page(newpage);
+			newpage = page;
+		}
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->cpages--;
+
+		putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * migrate_vma() - migrate a range of memory inside vma using accel copy
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @mentries: maximum number of entry in src or dst pfns array
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @src: array of hmm_pfn_t containing source pfns
+ * @dst: array of hmm_pfn_t containing destination pfns
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This will try to migrate a range of memory using callback to allocate and
+ * copy memory from source to destination. This function will first collect,
+ * lock and unmap pages in the range and then call alloc_and_copy() callback
+ * for device driver to allocate destination memory and copy from source.
+ *
+ * Then it will proceed and try to effectively migrate the page (struct page
+ * metadata) a step that can fail for various reasons. Before updating CPU page
+ * table it will call finalize_and_map() callback so that device driver can
+ * inspect what have been successfully migrated and update its own page table
+ * (this latter aspect is not mandatory and only make sense for some user of
+ * this API).
+ *
+ * Finaly the function update CPU page table and unlock the pages before
+ * returning 0.
+ *
+ * It will return an error code only if one of the argument is invalid.
+ */
+int migrate_vma(const struct migrate_vma_ops *ops,
+		struct vm_area_struct *vma,
+		unsigned long mentries,
+		unsigned long start,
+		unsigned long end,
+		unsigned long *src,
+		unsigned long *dst,
+		void *private)
+{
+	struct migrate_vma migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (!vma || !ops || !src || !dst || start >= end)
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	memset(src, 0, sizeof(*src) * ((end - start) >> PAGE_SHIFT));
+	migrate.src = src;
+	migrate.dst = dst;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.cpages = 0;
+	migrate.mpages = mentries;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	migrate_vma_collect(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Lock and isolate page */
+	migrate_vma_prepare(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/* Unmap pages */
+	migrate_vma_unmap(&migrate);
+	if (!migrate.cpages)
+		return 0;
+
+	/*
+	 * At this point pages are lock and unmap and thus they have stable
+	 * content and can safely be copied to destination memory that is
+	 * allocated by the callback.
+	 *
+	 * Note that migration can fail in migrate_vma_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src, dst, start, end, private);
+
+	/* This does the real migration of struct page */
+	migrate_vma_pages(&migrate);
+
+	ops->finalize_and_map(vma, src, dst, start, end, private);
+
+	/* Unlock and remap pages */
+	migrate_vma_finalize(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(migrate_vma);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 07/14] mm/migrate: migrate_vma() unmap page from vma while collecting pages
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 06/14] mm/migrate: new memory migration helper for use with device memory v3 Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 08/14] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 93 insertions(+), 15 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 150fc4d..d78c0e7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2142,9 +2142,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 {
 	struct migrate_vma *migrate = walk->private;
 	struct mm_struct *mm = walk->vma->vm_mm;
-	unsigned long addr = start;
+	unsigned long addr = start, unmaped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	int ret = 0;
 
 	if (pmd_none(*pmdp) || pmd_trans_unstable(pmdp)) {
 		/* FIXME support THP */
@@ -2152,9 +2153,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	}
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
 		unsigned long flags, pfn;
 		struct page *page;
+		swp_entry_t entry;
 		pte_t pte;
 		int ret;
 
@@ -2186,17 +2190,50 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
 		flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
+		/*
+		 * Optimize for common case where page is only map once in one
+		 * process. If we can lock the page then we can safely setup
+		 * special migration page table entry now.
+		 */
+		if (trylock_page(page)) {
+			pte_t swp_pte;
+
+			flags |= MIGRATE_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, pte_write(pte));
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regulat unmap we remove the rmap and
+			 * drop page refcount. Page won't be free as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmaped++;
+		}
+
 next:
 		migrate->src[migrate->npages++] = pfn | flags;
 		ret = migrate_vma_array_full(migrate);
 		if (ret) {
-			pte_unmap_unlock(ptep, ptl);
-			return ret;
+			ptep++;
+			break;
 		}
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
-	return 0;
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmaped)
+		flush_tlb_range(walk->vma, start, end);
+
+	return ret;
 }
 
 /*
@@ -2220,7 +2257,13 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	mm_walk.mm = migrate->vma->vm_mm;
 	mm_walk.private = migrate;
 
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
 	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
 }
 
 /*
@@ -2264,20 +2307,25 @@ static bool migrate_vma_check_page(struct page *page)
  */
 static void migrate_vma_prepare(struct migrate_vma *migrate)
 {
-	unsigned long addr = migrate->start, i = 0, size;
+	unsigned long addr = migrate->start, i = 0, size, restore = 0;
 	bool allow_drain = true;
 
 	lru_add_drain();
 
 	for (; i < migrate->npages && migrate->cpages; i++, addr += size) {
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		bool remap = true;
+
 		size = migrate_pfn_size(migrate->src[i]);
 
 		if (!page)
 			continue;
 
-		lock_page(page);
-		migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
+			remap = false;
+			lock_page(page);
+			migrate->src[i] |= MIGRATE_PFN_LOCKED;
+		}
 
 		if (!PageLRU(page) && allow_drain) {
 			/* Drain CPU's pagevec */
@@ -2286,10 +2334,16 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 		}
 
 		if (isolate_lru_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
-			put_page(page);
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+				put_page(page);
+			}
 			continue;
 		}
 
@@ -2297,13 +2351,37 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 		put_page(page);
 
 		if (!migrate_vma_check_page(page)) {
-			migrate->src[i] = 0;
-			unlock_page(page);
-			migrate->cpages--;
+			if (remap) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
 
-			putback_lru_page(page);
+				get_page(page);
+				putback_lru_page(page);
+			} else {
+				migrate->src[i] = 0;
+				unlock_page(page);
+				migrate->cpages--;
+
+				putback_lru_page(page);
+			}
 		}
 	}
+
+	for (i = 0; i < migrate->npages && restore; i++, addr += size) {
+		struct page *page = migrate_pfn_to_page(migrate->src[i]);
+		size = migrate_pfn_size(migrate->src[i]);
+
+		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		remove_migration_pte(page, migrate->vma, addr, page);
+
+		migrate->src[i] = 0;
+		unlock_page(page);
+		put_page(page);
+		restore--;
+	}
 }
 
 /*
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 08/14] mm/hmm: heterogeneous memory management (HMM for short)
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 07/14] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 09/14] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

HMM provides 3 separate functionality :
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 MAINTAINERS              |   7 +++
 include/linux/hmm.h      | 150 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |   4 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 |  82 ++++++++++++++++++++++++++
 7 files changed, 251 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 606c43e..f119a0c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5762,6 +5762,13 @@ S:	Supported
 F:	drivers/scsi/hisi_sas/
 F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	linux-wireless@vger.kernel.org
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..f00d519
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,150 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * HMM provides 3 separate functionality :
+ *   - Mirroring: synchronize CPU page table and device page table
+ *   - Device memory: allocating struct page for device memory
+ *   - Migration: migrating regular memory to device memory
+ *
+ * Each can be used independently from the others.
+ *
+ *
+ * Mirroring:
+ *
+ * HMM provide helpers to mirror process address space on a device. For this it
+ * provides several helpers to order device page table update in respect to CPU
+ * page table update. Requirement is that for any given virtual address the CPU
+ * and device page table can not point to different physical page. It uses the
+ * mmu_notifier API behind the scene.
+ *
+ * Device memory:
+ *
+ * HMM provides helpers to help leverage device memory either addressable like
+ * regular memory by the CPU or un-addressable at all. In both case the device
+ * memory is associated to dedicated structs page (which are allocated like for
+ * hotplug memory). Device memory management is under the responsibility of the
+ * device driver. HMM only allocate and initialize the struct pages associated
+ * with the device memory by hotpluging a ZONE_DEVICE memory range.
+ *
+ * Allocating struct page for device memory allow to use device memory allmost
+ * like any regular memory. Unlike regular memory it can not be added to the
+ * lru, nor can any memory allocation can use device memory directly. Device
+ * memory will only end up to be use in a process if device driver migrate some
+ * of the process memory from regular memory to device memory.
+ *
+ *
+ * Migration:
+ *
+ * Existing memory migration mechanism (mm/migrate.c) does not allow to use
+ * something else than the CPU to copy from source to destination memory. More
+ * over existing code is not tailor to drive migration from process virtual
+ * address rather than from list of pages. Finaly the migration flow does not
+ * allow for graceful failure at different step of the migration process.
+ *
+ * HMM solves all of the above through simple API :
+ *
+ *      hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private);
+ *
+ * With ops struct providing 2 callback alloc_and_copy() which allocated the
+ * destination memory and initialize it using source memory. Migration can fail
+ * after this step and thus last callback finalize_and_map() allow the device
+ * driver to know which page were successfully migrated and which were not.
+ *
+ * This can easily be use outside of HMM intended use case.
+ *
+ *
+ * This header file contain all the API related to this 3 functionality and
+ * each functions and struct are more thoroughly documented in below comments.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table have the write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value store in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only ! Not to be used by device driver ! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only ! Not to be used by device driver ! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8aced..4effdbf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -516,6 +517,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM need to track few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index cfee5ec..98a297f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -843,6 +844,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 0c33f46..9cdf361e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,10 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config HMM
+	bool
+	depends on MMU
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a..e4d9f48 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..e891fdd
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be use directly by device driver but by other HMM
+ * component. It allocates an HMM struct if mm does not have one and initialize
+ * it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be free once mm_struct goes away
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	/*
+	 * We should not need to lock here as no one should be able to register
+	 * a new HMM while an mm is being destroy. But just to be safe ...
+	 */
+	spin_lock(&mm->page_table_lock);
+	hmm = mm->hmm;
+	mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+	kfree(hmm);
+}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 09/14] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 08/14] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-03-13 22:10   ` Krishna Reddy
  2017-01-27 22:52 ` [HMM v17 10/14] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 101 ++++++++++++++++++++++++++++
 mm/Kconfig          |  15 +++++
 mm/hmm.c            | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 301 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f00d519..31e2c50 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
@@ -138,6 +139,106 @@ static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to use synchronize device page table with CPU page table ?
+ *
+ * Device driver must always synchronize with CPU page table update, for this
+ * they can either directly use mmu_notifier API or they can use the hmm_mirror
+ * API. Device driver can decide to register one mirror per device per process
+ * or just one mirror per process for a group of device. Pattern is :
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *          int ret;
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contain an hmm_mirror struct as one of its field.
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *          // Other device driver specific initialization
+ *      }
+ *
+ * Device driver must not free the struct containing hmm_mirror struct before
+ * calling hmm_mirror_unregister() expected usage is to do that when device
+ * driver is unbinding from an address space.
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          hmm_mirror_unregister(&das->mirror);
+ *          // Other device driver specific cleanup and now das can be free
+ *      }
+ *
+ * Once an hmm_mirror is register for an address space, device driver will get
+ * callback through the update() operation (see hmm_mirror_ops struct).
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* update() - update virtual address range of memory
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update: update's type (turn read only, unmap, ...)
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback is call when the CPU page table is updated, the device
+	 * driver must update device page table accordingly to update's action.
+	 *
+	 * Device driver callback must wait until device have fully updated its
+	 * view for the range. Note we plan to make this asynchronous in later
+	 * patches. So that multiple devices can schedule update to their page
+	 * table and once all device have schedule the update then we wait for
+	 * them to propagate.
+	 */
+	void (*update)(struct hmm_mirror *mirror,
+		       enum hmm_update action,
+		       unsigned long start,
+		       unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * of hmm_mirror struct with HMM. HMM will track list of all mirrors for each
+ * mm_struct (or each process).
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+			       struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 9cdf361e..598c38a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -293,6 +293,21 @@ config HMM
 	bool
 	depends on MMU
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	select HMM
+	select MMU_NOTIFIER
+	help
+	  HMM mirror is a set of helpers to mirror CPU page table into a device
+	  page table. There is two side, first keep both page table synchronize
+	  so that no virtual address can point to different page (but one page
+	  table might lag ie onee might still point to page while the other is
+	  is pointing to nothing).
+
+	  Second side of the equation is replicating CPU page table content for
+	  range of virtual address. This require careful synchronization with
+	  CPU page table update.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index e891fdd..b725c6d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,27 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting mirrors list
+ * @mirrors: list of mirrors for this mm
+ * @wait_queue: wait queue
+ * @sequence: we track update to CPU page table with a sequence number
+ * @mmu_notifier: mmu notifier to track update to CPU page table
+ * @notifier_count: number of currently active notifier count
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	atomic_t		sequence;
+	wait_queue_head_t	wait_queue;
+	struct mmu_notifier	mmu_notifier;
+	atomic_t		notifier_count;
 };
 
 /*
@@ -48,6 +61,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 		if (!hmm)
 			return NULL;
+		init_waitqueue_head(&hmm->wait_queue);
+		atomic_set(&hmm->notifier_count, 0);
+		INIT_LIST_HEAD(&hmm->mirrors);
+		atomic_set(&hmm->sequence, 0);
+		hmm->mmu_notifier.ops = NULL;
+		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 
 		spin_lock(&mm->page_table_lock);
@@ -80,3 +99,169 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 	kfree(hmm);
 }
+
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	/*
+	 * Mirror being added or remove is a rare event so list traversal isn't
+	 * protected by a lock, we rely on simple rules. All list modification
+	 * are done using list_add_rcu() and list_del_rcu() under a spinlock to
+	 * protect from concurrent addition or removal but not traversal.
+	 *
+	 * Because hmm_mirror_unregister() wait for all running invalidation to
+	 * complete (and thus all list traversal to finish). None of the mirror
+	 * struct can be freed from under us while traversing the list and thus
+	 * it is safe to dereference their list pointer even if they were just
+	 * remove.
+	 */
+	list_for_each_entry (mirror, &hmm->mirrors, list)
+		mirror->ops->update(mirror, action, start, end);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	/* Reverse order here because we are getting out of invalidation */
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+static int hmm_mirror_do_register(struct hmm_mirror *mirror,
+				  struct mm_struct *mm,
+				  const bool locked)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	/* Register mmu_notifier if not already, use mmap_sem for locking */
+	if (!mirror->hmm->mmu_notifier.ops) {
+		struct hmm *hmm = mirror->hmm;
+
+		if (!locked)
+			down_write(&mm->mmap_sem);
+		if (!hmm->mmu_notifier.ops) {
+			hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+			if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+				hmm->mmu_notifier.ops = NULL;
+				up_write(&mm->mmap_sem);
+				return -ENOMEM;
+			}
+		}
+		if (!locked)
+			up_write(&mm->mmap_sem);
+	}
+
+	spin_lock(&mirror->hmm->lock);
+	list_add_rcu(&mirror->list, &mirror->hmm->mirrors);
+	spin_unlock(&mirror->hmm->lock);
+
+	return 0;
+}
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space device driver must register an
+ * HMM mirror struct.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return hmm_mirror_do_register(mirror, mm, false);
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_register_locked() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * Same as hmm_mirror_register() except that mmap_sem must write locked !
+ */
+int hmm_mirror_register_locked(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return hmm_mirror_do_register(mirror, mm, true);
+}
+EXPORT_SYMBOL(hmm_mirror_register_locked);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&mirror->list);
+	spin_unlock(&hmm->lock);
+
+	/*
+	 * Wait for all active notifier so that it is safe to traverse mirror
+	 * list without any lock.
+	 */
+	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 10/14] mm/hmm/mirror: helper to snapshot CPU page table
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 09/14] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 11/14] mm/hmm/mirror: device page fault handler Jérôme Glisse
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  56 +++++++++++-
 mm/hmm.c            | 257 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 311 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 31e2c50..b5eafdc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,13 +83,28 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ: read permission set
  * HMM_PFN_WRITE: CPU page table have the write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
+ * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
+ *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
+ *      device (the entry will never have HMM_PFN_VALID set and the pfn value
+ *      is undefine)
+ * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_DEVICE (1 << 5)
+#define HMM_PFN_SPECIAL (1 << 6)
+#define HMM_PFN_UNADDRESSABLE (1 << 7)
+#define HMM_PFN_SHIFT 8
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -236,6 +251,43 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 int hmm_mirror_register_locked(struct hmm_mirror *mirror,
 			       struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot CPU page table call hmm_vma_get_pfns() then take device driver
+ * lock that serialize device page table update and call hmm_vma_range_done()
+ * to check if snapshot is still valid. The device driver page table update
+ * lock must also be use in the HMM mirror update() callback so that CPU page
+ * table invalidation serialize on it.
+ *
+ * YOU MUST CALL hmm_vma_range_dond() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index b725c6d..0ef06df 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,10 +19,15 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
+
 /*
  * struct hmm - HMM per mm struct
  *
@@ -37,6 +42,7 @@
 struct hmm {
 	struct mm_struct	*mm;
 	spinlock_t		lock;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	atomic_t		sequence;
 	wait_queue_head_t	wait_queue;
@@ -66,6 +72,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		INIT_LIST_HEAD(&hmm->mirrors);
 		atomic_set(&hmm->sequence, 0);
 		hmm->mmu_notifier.ops = NULL;
+		INIT_LIST_HEAD(&hmm->ranges);
 		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 
@@ -108,6 +115,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	rcu_read_unlock();
 
 	/*
 	 * Mirror being added or remove is a rare event so list traversal isn't
@@ -264,4 +287,238 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_empty(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_EMPTY;
+}
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static void hmm_vma_walk(struct vm_area_struct *vma,
+			 unsigned long start,
+			 unsigned long end,
+			 hmm_pfn_t *pfns)
+{
+	unsigned long addr, next;
+	hmm_pfn_t flag;
+
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	for (addr = start; addr < end; addr = next) {
+		unsigned long i = (addr - start) >> PAGE_SHIFT;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+		pmd_t pmd;
+
+		/*
+		 * We are accessing/faulting for a device from an unknown
+		 * thread that might be foreign to the mm we are faulting
+		 * against so do not call arch_vma_access_permitted() !
+		 */
+
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset(vma->vm_mm, addr);
+		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		if (pud_none(*pudp) || pud_bad(*pudp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd) || pmd_bad(pmd)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+			hmm_pfn_t flags = flag;
+
+			if (pmd_protnone(pmd)) {
+				hmm_pfns_clear(&pfns[i], addr, next);
+				continue;
+			}
+			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
+			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
+			continue;
+		}
+
+		ptep = pte_offset_map(pmdp, addr);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			pte_t pte = *ptep;
+
+			pfns[i] = 0;
+
+			if (pte_none(pte)) {
+				pfns[i] = HMM_PFN_EMPTY;
+				continue;
+			}
+
+			entry = pte_to_swp_entry(pte);
+			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				continue;
+			}
+
+			if (pte_present(pte)) {
+				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
+				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+				continue;
+			}
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			*/
+			if (is_device_entry(entry)) {
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
+				if (is_write_device_entry(entry))
+					pfns[i] |= HMM_PFN_WRITE;
+				pfns[i] |= HMM_PFN_DEVICE;
+				pfns[i] |= HMM_PFN_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (!is_migration_entry(entry)) {
+				pfns[i] = HMM_PFN_ERROR;
+			}
+		}
+		pte_unmap(ptep - 1);
+	}
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual address
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t provided by caller fill by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshot the CPU page table for a range of virtual address, snapshot
+ * validity is track by the range struct see hmm_vma_range_done() for further
+ * informations.
+ *
+ * The range struct is initialized and track CPU page table only if function
+ * returns success (0) then you must call hmm_vma_range_done() to stop range
+ * CPU page table update tracking.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARN !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have register a mirror (with hmm_mirror_register()) ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	hmm_vma_walk(vma, start, end, pfns);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being track
+ * Returns: false if range data have been invalidated, true otherwise
+ *
+ * Range struct is use to track update to CPU page table after call to
+ * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
+ * to data it gots from this function it calls hmm_vma_range_done() which stop
+ * the tracking.
+ *
+ * There is 2 way to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 11/14] mm/hmm/mirror: device page fault handler
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 10/14] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 12/14] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  26 +++++
 mm/hmm.c            | 269 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 267 insertions(+), 28 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b5eafdc..f19c2a0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -288,6 +288,32 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver unlike handle_mm_fault() it will not
+ * migrate any device memory back to system memory. The hmm_pfn_t array will be
+ * updated with fault result and current snapshot of the CPU page table for the
+ * range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be drop
+ * by the function if block argument is false, when that happen the function
+ * returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successfull for every single
+ * address or not, you need to inspect the hmm_pfn_t array to determine fault
+ * status for that address. Trying to fault inside an invalid vma will result
+ * in -EINVAL.
+ *
+ * See function description in mm/hmm.c for documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 0ef06df..a397d45 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -288,6 +288,15 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+
+static void hmm_pfns_error(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_ERROR;
+}
+
 static void hmm_pfns_empty(hmm_pfn_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
@@ -304,10 +313,43 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
 		*pfns = HMM_PFN_SPECIAL;
 }
 
-static void hmm_vma_walk(struct vm_area_struct *vma,
-			 unsigned long start,
-			 unsigned long end,
-			 hmm_pfn_t *pfns)
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	unsigned long npfns = (end - addr) >> PAGE_SHIFT;
+
+	memset(pfns, 0, sizeof(*pfns) * npfns);
+}
+
+static int hmm_vma_do_fault(struct vm_area_struct *vma,
+			    const hmm_pfn_t fault,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn,
+			    bool block)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	int r;
+
+	flags |= block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= (fault & HMM_PFN_WRITE) ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EAGAIN;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int hmm_vma_walk(struct vm_area_struct *vma,
+			const hmm_pfn_t fault,
+			unsigned long start,
+			unsigned long end,
+			hmm_pfn_t *pfns,
+			bool block)
 {
 	unsigned long addr, next;
 	hmm_pfn_t flag;
@@ -321,6 +363,7 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd_t *pmdp;
 		pte_t *ptep;
 		pmd_t pmd;
+		int ret;
 
 		/*
 		 * We are accessing/faulting for a device from an unknown
@@ -331,15 +374,37 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset(vma->vm_mm, addr);
 		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pudp = pud_alloc(vma->vm_mm, pgdp, addr);
+			if (!pudp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pud_addr_end(addr, end);
 		pudp = pud_offset(pgdp, addr);
 		if (pud_none(*pudp) || pud_bad(*pudp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pmdp = pmd_alloc(vma->vm_mm, pudp, addr);
+			if (!pmdp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pmd_addr_end(addr, end);
@@ -347,8 +412,24 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd = pmd_read_atomic(pmdp);
 		barrier();
 		if (pmd_none(pmd) || pmd_bad(pmd)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			/*
+			 * Use pte_alloc() instead of pte_alloc_map, because we
+			 * can't run pte_offset_map on the pmd, if an huge pmd
+			 * could materialize from under us.
+			 */
+			if (unlikely(pte_alloc(vma->vm_mm, pmdp, addr))) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
+			pmd = *pmdp;
 		}
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
@@ -356,10 +437,14 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 
 			if (pmd_protnone(pmd)) {
 				hmm_pfns_clear(&pfns[i], addr, next);
+				if (fault)
+					goto fault;
 				continue;
 			}
 			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
 			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			if ((flags & fault) != fault)
+				goto fault;
 			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
 				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
 			continue;
@@ -370,41 +455,63 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t pte = *ptep;
 
-			pfns[i] = 0;
-
 			if (pte_none(pte)) {
+				if (fault) {
+					pte_unmap(ptep);
+					goto fault;
+				}
 				pfns[i] = HMM_PFN_EMPTY;
 				continue;
 			}
 
 			entry = pte_to_swp_entry(pte);
 			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				if (fault) {
+					pte_unmap(ptep);
+					goto fault;
+				}
+				pfns[i] = 0;
 				continue;
 			}
 
 			if (pte_present(pte)) {
 				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
 				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-				continue;
-			}
-
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			*/
-			if (is_device_entry(entry)) {
+			} else if (is_device_entry(entry)) {
+				/* Do not fault device entry */
 				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_entry(entry))
 					pfns[i] |= HMM_PFN_WRITE;
 				pfns[i] |= HMM_PFN_DEVICE;
 				pfns[i] |= HMM_PFN_UNADDRESSABLE;
 				pfns[i] |= flag;
-			} else if (!is_migration_entry(entry)) {
+			} else if (is_migration_entry(entry) && fault) {
+				migration_entry_wait(vma->vm_mm, pmdp, addr);
+				/* Start again for current address */
+				next = addr;
+				ptep++;
+				break;
+			} else {
+				/* Report error for everything else */
 				pfns[i] = HMM_PFN_ERROR;
 			}
+			if ((fault & pfns[i]) != fault) {
+				pte_unmap(ptep);
+				goto fault;
+			}
 		}
 		pte_unmap(ptep - 1);
+		continue;
+
+fault:
+		ret = hmm_vma_do_fault(vma, fault, addr, &pfns[i], block);
+		if (ret)
+			return ret;
+		/* Start again for current address */
+		next = addr;
 	}
+
+	return 0;
 }
 
 /*
@@ -463,7 +570,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
-	hmm_vma_walk(vma, start, end, pfns);
+	hmm_vma_walk(vma, 0, start, end, pfns, false);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -474,14 +581,22 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  * @range: range being track
  * Returns: false if range data have been invalidated, true otherwise
  *
- * Range struct is use to track update to CPU page table after call to
- * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
- * to data it gots from this function it calls hmm_vma_range_done() which stop
- * the tracking.
+ * Range struct is use to track update to CPU page table after call to either
+ * hmm_vma_get_pfns() or hmm_vma_fault(). Once device driver is done using or
+ * want to lock update to data it gots from those functions it must call the
+ * hmm_vma_range_done() function which stop tracking CPU page table update.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be use while trying to duplicate CPU page table content for a range of
+ * virtual address.
  *
  * There is 2 way to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -492,7 +607,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -521,4 +636,102 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARN !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	hmm_pfn_t fault = HMM_PFN_READ | (write ? HMM_PFN_WRITE : 0);
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have register a mirror (with hmm_mirror_register()) ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	ret = hmm_vma_walk(vma, fault, start, end, pfns, block);
+	if (ret)
+		hmm_vma_range_done(vma, range);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 12/14] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 11/14] mm/hmm/mirror: device page fault handler Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 13/14] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE Jérôme Glisse
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans, Jérôme Glisse

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/migrate.h |   2 +
 mm/migrate.c            | 134 +++++++++++++++++++++++++++++++++++++-----------
 mm/rmap.c               |  47 +++++++++++++++++
 3 files changed, 153 insertions(+), 30 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index cd56e41..2d7904a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -129,6 +129,8 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #define MIGRATE_PFN_LOCKED	(1UL << (BITS_PER_LONG_LONG - 4))
 #define MIGRATE_PFN_WRITE	(1UL << (BITS_PER_LONG_LONG - 5))
 #define MIGRATE_PFN_ZERO	(1UL << (BITS_PER_LONG_LONG - 6))
+#define MIGRATE_PFN_DEVICE	(1UL << (BITS_PER_LONG_LONG - 7))
+#define MIGRATE_PFN_ERROR	(1UL << (BITS_PER_LONG_LONG - 8))
 #define MIGRATE_PFN_MASK	((1UL << (BITS_PER_LONG_LONG - PAGE_SHIFT)) - 1)
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
diff --git a/mm/migrate.c b/mm/migrate.c
index d78c0e7..bc14b8e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -40,6 +40,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -248,7 +249,15 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		pte = arch_make_huge_pte(pte, vma, new, 0);
 	}
 #endif
-	flush_dcache_page(new);
+
+	if (unlikely(is_zone_device_page(new)) && !is_addressable_page(new)) {
+		entry = make_device_entry(new, pte_write(pte));
+		pte = swp_entry_to_pte(entry);
+		if (pte_swp_soft_dirty(*ptep))
+			pte = pte_mksoft_dirty(pte);
+	} else
+		flush_dcache_page(new);
+
 	set_pte_at(mm, addr, ptep, pte);
 
 	if (PageHuge(new)) {
@@ -2165,17 +2174,44 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		pte = *ptep;
 		pfn = pte_pfn(pte);
 
-		if (!pte_present(pte)) {
+		if (pte_none(pte)) {
 			flags = pfn = 0;
 			goto next;
 		}
 
+		if (!pte_present(pte)) {
+			flags = pfn = 0;
+
+			/*
+			 * Only care about un-addressable device page special
+			 * page table entry. Other special swap entry are not
+			 * migratable and we ignore regular swaped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_entry(entry))
+				goto next;
+
+			page = device_entry_to_page(entry);
+			if (!dev_page_allow_migrate(page))
+				goto next;
+
+			flags = MIGRATE_PFN_VALID |
+				MIGRATE_PFN_DEVICE |
+				MIGRATE_PFN_MIGRATE;
+			if (is_write_device_entry(entry))
+				flags |= MIGRATE_PFN_WRITE;
+		} else {
+			page = vm_normal_page(migrate->vma, addr, pte);
+			flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
+			flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
+		}
+
 		/* FIXME support THP */
-		page = vm_normal_page(migrate->vma, addr, pte);
 		if (!page || !page->mapping || PageTransCompound(page)) {
 			flags = pfn = 0;
 			goto next;
 		}
+		pfn = page_to_pfn(page);
 
 		/*
 		 * By getting a reference on the page we pin it and blocks any
@@ -2187,8 +2223,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 */
 		get_page(page);
 		migrate->cpages++;
-		flags = MIGRATE_PFN_VALID | MIGRATE_PFN_MIGRATE;
-		flags |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 
 		/*
 		 * Optimize for common case where page is only map once in one
@@ -2290,6 +2324,13 @@ static bool migrate_vma_check_page(struct page *page)
 	if (PageCompound(page))
 		return false;
 
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		if (!dev_page_allow_migrate(page))
+			return false;
+		extra++;
+	}
+
 	if ((page_count(page) - extra) > page_mapcount(page))
 		return false;
 
@@ -2327,28 +2368,31 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			migrate->src[i] |= MIGRATE_PFN_LOCKED;
 		}
 
-		if (!PageLRU(page) && allow_drain) {
-			/* Drain CPU's pagevec */
-			lru_add_drain_all();
-			allow_drain = false;
-		}
+		/* ZONE_DEVICE page are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
 
-		if (isolate_lru_page(page)) {
-			if (remap) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
-				restore++;
-			} else {
-				migrate->src[i] = 0;
-				unlock_page(page);
-				migrate->cpages--;
-				put_page(page);
+			if (isolate_lru_page(page)) {
+				if (remap) {
+					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+					migrate->cpages--;
+					restore++;
+				} else {
+					migrate->src[i] = 0;
+					unlock_page(page);
+					migrate->cpages--;
+					put_page(page);
+				}
+				continue;
 			}
-			continue;
-		}
 
-		/* Drop the reference we took in collect */
-		put_page(page);
+			/* Drop the reference we took in collect */
+			put_page(page);
+		}
 
 		if (!migrate_vma_check_page(page)) {
 			if (remap) {
@@ -2356,14 +2400,19 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 				migrate->cpages--;
 				restore++;
 
-				get_page(page);
-				putback_lru_page(page);
+				if (!is_zone_device_page(page)) {
+					get_page(page);
+					putback_lru_page(page);
+				}
 			} else {
 				migrate->src[i] = 0;
 				unlock_page(page);
 				migrate->cpages--;
 
-				putback_lru_page(page);
+				if (!is_zone_device_page(page))
+					putback_lru_page(page);
+				else
+					put_page(page);
 			}
 		}
 	}
@@ -2428,7 +2477,10 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 		unlock_page(page);
 		restore--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 	}
 }
 
@@ -2459,6 +2511,22 @@ static void migrate_vma_pages(struct migrate_vma *migrate)
 
 		mapping = page_mapping(page);
 
+		if (is_zone_device_page(newpage)) {
+			if (!dev_page_allow_migrate(newpage)) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				continue;
+			}
+
+			/*
+			 * For now only support private anonymous when migrating
+			 * to un-addressable device memory.
+			 */
+			if (mapping && !is_addressable_page(newpage)) {
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				continue;
+			}
+		}
+
 		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC, false);
 		if (r != MIGRATEPAGE_SUCCESS)
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
@@ -2497,11 +2565,17 @@ static void migrate_vma_finalize(struct migrate_vma *migrate)
 		unlock_page(page);
 		migrate->cpages--;
 
-		putback_lru_page(page);
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
 
 		if (newpage != page) {
 			unlock_page(newpage);
-			putback_lru_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
 		}
 	}
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 91619fd..c7b0b54 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -61,6 +61,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1454,6 +1455,52 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			goto out;
 	}
 
+	if ((flags & TTU_MIGRATION) && is_zone_device_page(page)) {
+		swp_entry_t entry;
+		pte_t swp_pte;
+		pmd_t *pmdp;
+
+		if (!dev_page_allow_migrate(page))
+			goto out;
+
+		pmdp = mm_find_pmd(mm, address);
+		if (!pmdp)
+			goto out;
+
+		pte = pte_offset_map_lock(mm, pmdp, address, &ptl);
+		if (!pte)
+			goto out;
+
+		pteval = ptep_get_and_clear(mm, address, pte);
+		if (pte_present(pteval) || pte_none(pteval)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		entry = pte_to_swp_entry(pteval);
+		if (!is_device_entry(entry)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		if (device_entry_to_page(entry) != page) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		entry = make_migration_entry(page, 0);
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(*pte))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(mm, address, pte, swp_pte);
+		goto discard;
+	}
+
 	pte = page_check_address(page, mm, address, &ptl,
 				 PageTransCompound(page));
 	if (!pte)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 13/14] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 12/14] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-27 22:52 ` [HMM v17 14/14] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2 Jérôme Glisse
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 115 +++++++++++++++
 mm/Kconfig          |   7 +
 mm/hmm.c            | 402 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 524 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f19c2a0..2332f89 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/migrate.h>
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -317,6 +322,116 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ *
+ * This an helper structure for device driver that do not wish to implement
+ * to gory details related to hotpluging new memoy and in allocating struct
+ * pages.
+ *
+ * Device driver can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct device			*device;
+	struct dev_pagemap		pagemap;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+};
+
+/*
+ * To add (hotplug) device memory, it assumes that there is no real resource
+ * that reserve a range in the physical address space (this is intended to be
+ * use by un-addressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * Device driver can wrap the hmm_devmem struct inside a private device driver
+ * struct. Device driver must call hmm_devmem_remove() before device goes away
+ * and before freeing the hmm_devmem struct memory.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size);
+void hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long mentry,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 598c38a..98d5c4b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -308,6 +308,13 @@ config HMM_MIRROR
 	  range of virtual address. This require careful synchronization with
 	  CPU page table update.
 
+config HMM_DEVMEM
+	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+	select HMM
+	help
+	  HMM devmem are helpers to leverage new ZONE_DEVICE feature. This is
+	  just to avoid device driver to replicate boiler plate code.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index a397d45..61a8c24 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/mmu_notifier.h>
 
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
 
 /*
  * struct hmm - HMM per mm struct
@@ -735,3 +740,400 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_ref_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+}
+
+static void hmm_devmem_ref_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_exit, data);
+}
+
+static void hmm_devmem_ref_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_kill(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, &hmm_devmem_ref_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+static DEFINE_MUTEX(hmm_devmem_lock);
+static RADIX_TREE(hmm_devmem_radix, GFP_KERNEL);
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
+static void hmm_devmem_radix_release(struct resource *resource)
+{
+	resource_size_t key, align_start, align_size, align_end;
+
+	align_start = resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+	align_end = align_start + align_size - 1;
+
+	mutex_lock(&hmm_devmem_lock);
+	for (key = resource->start; key <= resource->end; key += SECTION_SIZE)
+		radix_tree_delete(&hmm_devmem_radix, key >> PA_SECTION_SHIFT);
+	mutex_unlock(&hmm_devmem_lock);
+}
+
+static void hmm_devmem_release(struct device *dev, void *data)
+{
+	struct hmm_devmem *devmem = data;
+	resource_size_t align_start, align_size;
+	struct resource *resource = devmem->resource;
+
+	if (percpu_ref_tryget_live(&devmem->ref)) {
+		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
+		percpu_ref_put(&devmem->ref);
+	}
+
+	/* pages are dead and unused, undo the arch mapping */
+	align_start = resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(resource_size(resource), SECTION_SIZE);
+	arch_remove_memory(align_start, align_size, devmem->pagemap.flags);
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+	hmm_devmem_radix_release(resource);
+}
+
+static struct hmm_devmem *hmm_devmem_find(resource_size_t phys)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return radix_tree_lookup(&hmm_devmem_radix, phys >> PA_SECTION_SHIFT);
+}
+
+static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
+{
+	resource_size_t key, align_start, align_size, align_end;
+	struct device *device = devmem->device;
+	pgprot_t pgprot = PAGE_KERNEL;
+	int ret, nid, is_ram;
+	unsigned long pfn;
+
+	align_start = devmem->resource->start & ~(SECTION_SIZE - 1);
+	align_size = ALIGN(devmem->resource->start +
+			   resource_size(devmem->resource),
+			   SECTION_SIZE) - align_start;
+
+	is_ram = region_intersects(align_start, align_size,
+				   IORESOURCE_SYSTEM_RAM,
+				   IORES_DESC_NONE);
+	if (is_ram == REGION_MIXED) {
+		WARN_ONCE(1, "%s attempted on mixed region %pr\n",
+				__func__, devmem->resource);
+		return -ENXIO;
+	}
+	if (is_ram == REGION_INTERSECTS)
+		return -ENXIO;
+
+	devmem->pagemap.flags = MEMORY_DEVICE |
+				MEMORY_DEVICE_ALLOW_MIGRATE |
+				MEMORY_DEVICE_UNADDRESSABLE;
+	devmem->pagemap.res = devmem->resource;
+	devmem->pagemap.page_fault = hmm_devmem_fault;
+	devmem->pagemap.page_free = hmm_devmem_free;
+	devmem->pagemap.dev = devmem->device;
+	devmem->pagemap.ref = &devmem->ref;
+	devmem->pagemap.data = devmem;
+
+	mutex_lock(&hmm_devmem_lock);
+	align_end = align_start + align_size - 1;
+	for (key = align_start; key <= align_end; key += SECTION_SIZE) {
+		struct hmm_devmem *dup;
+
+		rcu_read_lock();
+		dup = hmm_devmem_find(key);
+		rcu_read_unlock();
+		if (dup) {
+			dev_err(device, "%s: collides with mapping for %s\n",
+				__func__, dev_name(dup->device));
+			mutex_unlock(&hmm_devmem_lock);
+			ret = -EBUSY;
+			goto error;
+		}
+		ret = radix_tree_insert(&hmm_devmem_radix,
+					key >> PA_SECTION_SHIFT,
+					devmem);
+		if (ret) {
+			dev_err(device, "%s: failed: %d\n", __func__, ret);
+			mutex_unlock(&hmm_devmem_lock);
+			goto error_radix;
+		}
+	}
+	mutex_unlock(&hmm_devmem_lock);
+
+	nid = dev_to_node(device);
+	if (nid < 0)
+		nid = numa_mem_id();
+
+	ret = track_pfn_remap(NULL, &pgprot, PHYS_PFN(align_start),
+			      0, align_size);
+	if (ret)
+		goto error_radix;
+
+	ret = arch_add_memory(nid, align_start, align_size,
+			      devmem->pagemap.flags);
+	if (ret)
+		goto error_add_memory;
+
+	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		/*
+		 * ZONE_DEVICE pages union ->lru with a ->pgmap back
+		 * pointer.  It is a bug if a ZONE_DEVICE page is ever
+		 * freed or placed on a driver-private list.  Seed the
+		 * storage with LIST_POISON* values.
+		 */
+		list_del(&page->lru);
+		page->pgmap = &devmem->pagemap;
+	}
+	return 0;
+
+error_add_memory:
+	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
+error_radix:
+	hmm_devmem_radix_release(devmem->resource);
+error:
+	return ret;
+}
+
+static int hmm_devmem_match(struct device *dev, void *data, void *match_data)
+{
+	struct hmm_devmem *devmem = data;
+
+	return devmem->resource == match_data;
+}
+
+static void hmm_devmem_pages_remove(struct hmm_devmem *devmem)
+{
+	devres_release(devmem->device, &hmm_devmem_release,
+		       &hmm_devmem_match, devmem->resource);
+}
+
+/*
+ * hmm_devmem_add() - hotplug fake ZONE_DEVICE memory for device memory
+ *
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: pointer to new hmm_devmem struct ERR_PTR otherwise
+ *
+ * This first find an empty range of physical address big enough to for the new
+ * resource and then hotplug it as ZONE_DEVICE memory allocating struct page.
+ * It does not do anything beside that, all events affecting the memory will go
+ * through the various callback provided by hmm_devmem_ops struct.
+ */
+struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
+				  struct device *device,
+				  unsigned long size)
+{
+	struct hmm_devmem *devmem;
+	resource_size_t addr;
+	int ret;
+
+	devmem = devres_alloc_node(&hmm_devmem_release, sizeof(*devmem),
+				   GFP_KERNEL, dev_to_node(device));
+	if (!devmem)
+		return ERR_PTR(-ENOMEM);
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
+			      0, GFP_KERNEL);
+	if (ret)
+		goto error_percpu_ref;
+
+	ret = devm_add_action(device, hmm_devmem_ref_exit, &devmem->ref);
+	if (ret)
+		goto error_devm_add_action;
+
+	size = ALIGN(size, SECTION_SIZE);
+	addr = (iomem_resource.end + 1ULL) - size;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size && addr >= iomem_resource.start; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error_no_resource;
+		}
+		devmem->resource->desc = IORES_DESC_UNADDRESSABLE_MEMORY;
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error_no_resource;
+	}
+
+	devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
+	devmem->pfn_last = devmem->pfn_first +
+			   (resource_size(devmem->resource) >> PAGE_SHIFT);
+
+	ret = hmm_devmem_pages_create(devmem);
+	if (ret)
+		goto error_pages;
+
+	devres_add(device, devmem);
+
+	ret = devm_add_action(device, hmm_devmem_ref_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_remove(devmem);
+		return ERR_PTR(ret);
+	}
+
+	return devmem;
+
+error_pages:
+	devm_release_mem_region(device, devmem->resource->start,
+				resource_size(devmem->resource));
+error_no_resource:
+error_devm_add_action:
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+error_percpu_ref:
+	devres_free(devmem);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ *
+ * This will hot remove memory that was hotplug by hmm_devmem_add on behalf of
+ * device driver. It will free struct page and remove the resource that reserve
+ * the physical address range for this device memory.
+ *
+ * Device driver can not free the struct while this function return false, it
+ * must call over and over this function until it returns true. Note that if
+ * there is a refcount bug this might never happen !
+ */
+void hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	resource_size_t start, size;
+	struct device *device;
+
+	if (!devmem)
+		return;
+
+	device = devmem->device;
+	start = devmem->resource->start;
+	size = resource_size(devmem->resource);
+
+	hmm_devmem_ref_kill(&devmem->ref);
+	hmm_devmem_ref_exit(&devmem->ref);
+	hmm_devmem_pages_remove(devmem);
+
+	devm_release_mem_region(device, start, size);
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @mentry: maximum number of entry in src or dst array
+ * @src: array of unsigned long containing source pfns
+ * @dst: array of unsigned long containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around migrate_vma() which check the migration status
+ * for a given fault address and return corresponding page fault handler status
+ * ie 0 on success or VM_FAULT_SIGBUS if migration failed for fault address.
+ *
+ * This is an helper intendend to be use by ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct migrate_vma_ops *ops,
+			   unsigned long mentry,
+			   unsigned long *src,
+			   unsigned long *dst,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private)
+{
+	unsigned long i, size, tmp;
+	if (migrate_vma(ops, vma, mentry, start, end, src, dst, private))
+		return VM_FAULT_SIGBUS;
+
+	for (i = 0, tmp = start; tmp < addr; i++, tmp += size) {
+		size = migrate_pfn_size(src[i]);
+	}
+	if (dst[i] & MIGRATE_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v17 14/14] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 13/14] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE Jérôme Glisse
@ 2017-01-27 22:52 ` Jérôme Glisse
  2017-01-28  5:39 ` [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 John Hubbard
  2017-02-22  7:19 ` Balbir Singh
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2017-01-27 22:52 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans,
	Jérôme Glisse, Evgeny Baskakov, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is usefull to device driver that want to manage multiple physical
device memory under same struct device umbrella.

Changed since v1:
  - Improve commit message
  - Add drvdata parameter to set on struct device

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 +++++++++++-
 mm/hmm.c            | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2332f89..5528081 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,11 +76,11 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/migrate.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -429,6 +429,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned		minor;
+};
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void *drvdata);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 61a8c24..41f11f2 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mmzone.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
@@ -1136,4 +1137,99 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void *drvdata)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt,0,1,"hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	dev_set_drvdata(&hmm_device->device, drvdata);
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+	unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+	class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2017-01-27 22:52 ` [HMM v17 14/14] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2 Jérôme Glisse
@ 2017-01-28  5:39 ` John Hubbard
  2017-02-22  7:19 ` Balbir Singh
  15 siblings, 0 replies; 23+ messages in thread
From: John Hubbard @ 2017-01-28  5:39 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: Naoya Horiguchi, David Nellans, Evgeny Baskakov

On 01/27/2017 02:52 PM, Jérôme Glisse wrote:
> Cliff note: HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code. Second it allows
> to mirror process address space on a device.
>
> Change since v16:
>   - move HMM unaddressable device memory to its own radix tree and
>     thus find_dev_pagemap() will no longer return HMM dev_pagemap
>   - rename HMM migration helper (drop the prefix) and make them
>     completely independent of HMM
>
>     Migration can now be use to implement thing like multi-threaded
>     copy or make use of specific memory allocator for destination
>     memory.

We're about to do our usual testing with this, but there will be a brief pause first (the driver API 
has changed slightly).

thanks
john h

>
> Work is under way to use this feature inside nouveau (the upstream
> open source driver for NVidia GPU) either 411 or 4.12 timeframe.
> But this patchset have been otherwise tested with the close source
> driver for NVidia GPU and thus we are confident it works and allow
> to use the hardware for seamless interaction between CPU and GPU
> in common address space of a process.
>
> I also discussed the features with other company and i am confident
> it can be use on other, yet, unrelease hardware.
>
> Please condiser applying for 4.11
>
>
> Know issues:
>
> Device memory pick some random unuse physical address range. Latter
> memory hotplug might fails because of this. Intention is to fix this
> in latter patchset to use physical address above the platform limit
> thus making sure that no real memory can be hotplug at conflicting
> address.
>
>
> Patchset overview:
>
> Patchset is divided into 3 features that can each be use independently
> from one another. First is changes to ZONE_DEVICE so we can have struct
> page for device un-addressable memory (patch 1-4 and 13-14). Second is
> process address space mirroring (patch 8 to 11), this allow to snapshot
> CPU page table and to keep the device page table synchronize with the
> CPU one.
>
> Last is a new page migration helper which allow migration for range of
> virtual address using hardware copy engine (patch 5-7 for new migrate
> function and 12 for migration of un-addressable memory).
>
>
> Future plan:
>
> In this patchset i restricted myself to set of core features what
> is missing:
>   - force read only on CPU for memory duplication and GPU atomic
>   - changes to mmu_notifier for optimization purposes
>   - migration of file back page to device memory
>
> I plan to submit a couple more patchset to implement those features
> once core HMM is upstream.
>
> Git tree:
> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v17
>
>
> Previous patchset posting :
>     v1 http://lwn.net/Articles/597289/
>     v2 https://lkml.org/lkml/2014/6/12/559
>     v3 https://lkml.org/lkml/2014/6/13/633
>     v4 https://lkml.org/lkml/2014/8/29/423
>     v5 https://lkml.org/lkml/2014/11/3/759
>     v6 http://lwn.net/Articles/619737/
>     v7 http://lwn.net/Articles/627316/
>     v8 https://lwn.net/Articles/645515/
>     v9 https://lwn.net/Articles/651553/
>     v10 https://lwn.net/Articles/654430/
>     v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
>     v12 http://www.kernelhub.org/?msg=972982&p=2
>     v13 https://lwn.net/Articles/706856/
>     v14 https://lkml.org/lkml/2016/12/8/344
>     v15 http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1304107.html
>     v16 http://www.spinics.net/lists/linux-mm/msg119814.html
>
> Jérôme Glisse (14):
>   mm/memory/hotplug: convert device bool to int to allow for more flags
>     v2
>   mm/ZONE_DEVICE/free-page: callback when page is freed v2
>   mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
>     memory v3
>   mm/ZONE_DEVICE/x86: add support for un-addressable device memory
>   mm/migrate: add new boolean copy flag to migratepage() callback
>   mm/migrate: new memory migration helper for use with device memory v3
>   mm/migrate: migrate_vma() unmap page from vma while collecting pages
>   mm/hmm: heterogeneous memory management (HMM for short)
>   mm/hmm/mirror: mirror process address space on device with HMM helpers
>   mm/hmm/mirror: helper to snapshot CPU page table
>   mm/hmm/mirror: device page fault handler
>   mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
>   mm/hmm/devmem: device memory hotplug using ZONE_DEVICE
>   mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2
>
>  MAINTAINERS                                |    7 +
>  arch/ia64/mm/init.c                        |   23 +-
>  arch/powerpc/mm/mem.c                      |   22 +-
>  arch/s390/mm/init.c                        |   10 +-
>  arch/sh/mm/init.c                          |   22 +-
>  arch/tile/mm/init.c                        |   10 +-
>  arch/x86/mm/init_32.c                      |   23 +-
>  arch/x86/mm/init_64.c                      |   41 +-
>  drivers/staging/lustre/lustre/llite/rw26.c |    8 +-
>  fs/aio.c                                   |    7 +-
>  fs/btrfs/disk-io.c                         |   11 +-
>  fs/hugetlbfs/inode.c                       |    9 +-
>  fs/nfs/internal.h                          |    5 +-
>  fs/nfs/write.c                             |    9 +-
>  fs/proc/task_mmu.c                         |   10 +-
>  fs/ubifs/file.c                            |    8 +-
>  include/linux/balloon_compaction.h         |    3 +-
>  include/linux/fs.h                         |   13 +-
>  include/linux/hmm.h                        |  464 +++++++++++
>  include/linux/ioport.h                     |    1 +
>  include/linux/memory_hotplug.h             |   31 +-
>  include/linux/memremap.h                   |   39 +-
>  include/linux/migrate.h                    |   83 +-
>  include/linux/mm_types.h                   |    5 +
>  include/linux/swap.h                       |   18 +-
>  include/linux/swapops.h                    |   67 ++
>  kernel/fork.c                              |    2 +
>  kernel/memremap.c                          |   31 +-
>  mm/Kconfig                                 |   38 +
>  mm/Makefile                                |    1 +
>  mm/balloon_compaction.c                    |    2 +-
>  mm/hmm.c                                   | 1235 ++++++++++++++++++++++++++++
>  mm/memory.c                                |   64 +-
>  mm/memory_hotplug.c                        |   14 +-
>  mm/migrate.c                               |  659 ++++++++++++++-
>  mm/mprotect.c                              |   12 +
>  mm/rmap.c                                  |   47 ++
>  mm/zsmalloc.c                              |   12 +-
>  38 files changed, 2986 insertions(+), 80 deletions(-)
>  create mode 100644 include/linux/hmm.h
>  create mode 100644 mm/hmm.c
>
> --
> 2.4.3
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
  2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2017-01-28  5:39 ` [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 John Hubbard
@ 2017-02-22  7:19 ` Balbir Singh
  2017-02-22  8:16   ` Andrew Morton
  15 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2017-02-22  7:19 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Naoya Horiguchi,
	David Nellans

On Fri, Jan 27, 2017 at 05:52:07PM -0500, Jérôme Glisse wrote:
> Cliff note: HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code. Second it allows
> to mirror process address space on a device.
> 
> Change since v16:
>   - move HMM unaddressable device memory to its own radix tree and
>     thus find_dev_pagemap() will no longer return HMM dev_pagemap
>   - rename HMM migration helper (drop the prefix) and make them
>     completely independent of HMM
> 
>     Migration can now be use to implement thing like multi-threaded
>     copy or make use of specific memory allocator for destination
>     memory.
> 
> Work is under way to use this feature inside nouveau (the upstream
> open source driver for NVidia GPU) either 411 or 4.12 timeframe.
> But this patchset have been otherwise tested with the close source
> driver for NVidia GPU and thus we are confident it works and allow
> to use the hardware for seamless interaction between CPU and GPU
> in common address space of a process.
> 
> I also discussed the features with other company and i am confident
> it can be use on other, yet, unrelease hardware.
> 
> Please condiser applying for 4.11
>

Andrew, do we expect to get this in 4.11/4.12? Just curious.

Balbir Singh. 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
  2017-02-22  7:19 ` Balbir Singh
@ 2017-02-22  8:16   ` Andrew Morton
  2017-02-22  8:27     ` Balbir Singh
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2017-02-22  8:16 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	Naoya Horiguchi, David Nellans

On Wed, 22 Feb 2017 18:19:15 +1100 Balbir Singh <bsingharora@gmail.com> wrote:

> On Fri, Jan 27, 2017 at 05:52:07PM -0500, J__r__me Glisse wrote:
> > Cliff note: HMM offers 2 things (each standing on its own). First
> > it allows to use device memory transparently inside any process
> > without any modifications to process program code. Second it allows
> > to mirror process address space on a device.
> > 
> > Change since v16:
> >   - move HMM unaddressable device memory to its own radix tree and
> >     thus find_dev_pagemap() will no longer return HMM dev_pagemap
> >   - rename HMM migration helper (drop the prefix) and make them
> >     completely independent of HMM
> > 
> >     Migration can now be use to implement thing like multi-threaded
> >     copy or make use of specific memory allocator for destination
> >     memory.
> > 
> > Work is under way to use this feature inside nouveau (the upstream
> > open source driver for NVidia GPU) either 411 or 4.12 timeframe.
> > But this patchset have been otherwise tested with the close source
> > driver for NVidia GPU and thus we are confident it works and allow
> > to use the hardware for seamless interaction between CPU and GPU
> > in common address space of a process.
> > 
> > I also discussed the features with other company and i am confident
> > it can be use on other, yet, unrelease hardware.
> > 
> > Please condiser applying for 4.11
> >
> 
> Andrew, do we expect to get this in 4.11/4.12? Just curious.
> 

I'll be taking a serious look after -rc1.

The lack of reviewed-by, acked-by and tested-by is a concern.  It's
rather odd for a patchset in the 17th revision!  What's up with that?

Have you reviewed or tested the patches?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
  2017-02-22  8:16   ` Andrew Morton
@ 2017-02-22  8:27     ` Balbir Singh
  2017-02-22 15:49       ` Reza Arbab
  2017-02-22 23:58       ` John Hubbard
  0 siblings, 2 replies; 23+ messages in thread
From: Balbir Singh @ 2017-02-22  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jérôme Glisse, linux-kernel, linux-mm, John Hubbard,
	Naoya Horiguchi, David Nellans, Aneesh Kumar KV, Reza Arbab,
	Srikar Dronamraju, haren

On Wed, Feb 22, 2017 at 7:16 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 22 Feb 2017 18:19:15 +1100 Balbir Singh <bsingharora@gmail.com> wrote:
>
>> On Fri, Jan 27, 2017 at 05:52:07PM -0500, J__r__me Glisse wrote:
>> > Cliff note: HMM offers 2 things (each standing on its own). First
>> > it allows to use device memory transparently inside any process
>> > without any modifications to process program code. Second it allows
>> > to mirror process address space on a device.
>> >
>> > Change since v16:
>> >   - move HMM unaddressable device memory to its own radix tree and
>> >     thus find_dev_pagemap() will no longer return HMM dev_pagemap
>> >   - rename HMM migration helper (drop the prefix) and make them
>> >     completely independent of HMM
>> >
>> >     Migration can now be use to implement thing like multi-threaded
>> >     copy or make use of specific memory allocator for destination
>> >     memory.
>> >
>> > Work is under way to use this feature inside nouveau (the upstream
>> > open source driver for NVidia GPU) either 411 or 4.12 timeframe.
>> > But this patchset have been otherwise tested with the close source
>> > driver for NVidia GPU and thus we are confident it works and allow
>> > to use the hardware for seamless interaction between CPU and GPU
>> > in common address space of a process.
>> >
>> > I also discussed the features with other company and i am confident
>> > it can be use on other, yet, unrelease hardware.
>> >
>> > Please condiser applying for 4.11
>> >
>>
>> Andrew, do we expect to get this in 4.11/4.12? Just curious.
>>
>
> I'll be taking a serious look after -rc1.
>
> The lack of reviewed-by, acked-by and tested-by is a concern.  It's
> rather odd for a patchset in the 17th revision!  What's up with that?
>
> Have you reviewed or tested the patches?

I reviewed v14/15 of the patches. Aneesh reviewed some versions as
well. I know a few people who tested a small subset of the patches,
I'll get them to report back as well. I think John Hubbard has been
testing iterations as well. CC'ing other interested people as well

Balbir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v17 06/14] mm/migrate: new memory migration helper for use with device memory v3
  2017-01-27 22:52 ` [HMM v17 06/14] mm/migrate: new memory migration helper for use with device memory v3 Jérôme Glisse
@ 2017-02-22 15:31   ` Reza Arbab
  0 siblings, 0 replies; 23+ messages in thread
From: Reza Arbab @ 2017-02-22 15:31 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Naoya Horiguchi,
	David Nellans, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti

On Fri, Jan 27, 2017 at 05:52:13PM -0500, Jérôme Glisse wrote:
>This patch add a new memory migration helpers, which migrate memory
>backing a range of virtual address of a process to different memory
>(which can be allocated through special allocator). It differs from
>numa migration by working on a range of virtual address and thus by
>doing migration in chunk that can be large enough to use DMA engine
>or special copy offloading engine.

Just wanted to say I've found these migration helpers quite useful. I've 
been prototyping some driver code which uses them, rebasing on each HMM 
revision since v14. So for what it's worth, 

Acked-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>

-- 
Reza Arbab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
  2017-02-22  8:27     ` Balbir Singh
@ 2017-02-22 15:49       ` Reza Arbab
  2017-02-22 23:58       ` John Hubbard
  1 sibling, 0 replies; 23+ messages in thread
From: Reza Arbab @ 2017-02-22 15:49 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, Jérôme Glisse, linux-kernel, linux-mm,
	John Hubbard, Naoya Horiguchi, David Nellans, Aneesh Kumar KV,
	Srikar Dronamraju, haren

On Wed, Feb 22, 2017 at 07:27:10PM +1100, Balbir Singh wrote:
>On Wed, Feb 22, 2017 at 7:16 PM, Andrew Morton
><akpm@linux-foundation.org> wrote:
>> On Wed, 22 Feb 2017 18:19:15 +1100 Balbir Singh <bsingharora@gmail.com> wrote:
>>> Andrew, do we expect to get this in 4.11/4.12? Just curious.
>>>
>>
>> I'll be taking a serious look after -rc1.
>>
>> The lack of reviewed-by, acked-by and tested-by is a concern.  It's
>> rather odd for a patchset in the 17th revision!  What's up with that?
>>
>> Have you reviewed or tested the patches?
>
>I reviewed v14/15 of the patches. Aneesh reviewed some versions as
>well. I know a few people who tested a small subset of the patches,
>I'll get them to report back as well. I think John Hubbard has been
>testing iterations as well. CC'ing other interested people as well

I've been testing the migration helper subset in each version since v14.  
Apologies for not having chimed in.

Just sent a Tested-by for that part of v17.

-- 
Reza Arbab

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17
  2017-02-22  8:27     ` Balbir Singh
  2017-02-22 15:49       ` Reza Arbab
@ 2017-02-22 23:58       ` John Hubbard
  1 sibling, 0 replies; 23+ messages in thread
From: John Hubbard @ 2017-02-22 23:58 UTC (permalink / raw)
  To: Balbir Singh, Andrew Morton
  Cc: Jérôme Glisse, linux-kernel, linux-mm, Naoya Horiguchi,
	David Nellans, Aneesh Kumar KV, Reza Arbab, Srikar Dronamraju,
	haren, Evgeny Baskakov

On 02/22/2017 12:27 AM, Balbir Singh wrote:
> On Wed, Feb 22, 2017 at 7:16 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Wed, 22 Feb 2017 18:19:15 +1100 Balbir Singh <bsingharora@gmail.com> wrote:
>>>
>>> Andrew, do we expect to get this in 4.11/4.12? Just curious.
>>>
>>
>> I'll be taking a serious look after -rc1.
>>
>> The lack of reviewed-by, acked-by and tested-by is a concern.  It's
>> rather odd for a patchset in the 17th revision!  What's up with that?
>>
>> Have you reviewed or tested the patches?
>
> I reviewed v14/15 of the patches. Aneesh reviewed some versions as
> well. I know a few people who tested a small subset of the patches,
> I'll get them to report back as well. I think John Hubbard has been
> testing iterations as well. CC'ing other interested people as well
>
> Balbir
>

Yes, Evgeny Baskakov and I have been testing each of the posted versions. We are using both 
migration and mirroring, and have a small set of multi-threaded and multi-device tests. I've been 
procastinating about writing up a summary of the test results, partly because the patchset is still 
changing (bug fixes, new features, API changes) and so we keep resetting our testing.

We (ahem, actually Evgeny has done most of the work) have been debugging and proposing fixes 
directly to Jerome, and that email traffic with Jerome has not been CC-ing this list, so things have 
looked a little quieter than they really were.

Anyway, a very rudimentary testing report:

1. What we are testing: Our latest testing (in the last few weeks) has been against Jerome's repo, here:
	git://people.freedesktop.org/~glisse/linux (branch: hmm-next)

which has moved ahead from his hmm-v17 branch. hmm-next adds a few bug fixes, and a new feature 
(populating CPU pages on a GPU fault). Here are the differences in summary:

$ git diff --stat hmm-v17 hmm-next
  drivers/char/Kconfig             |   10 +
  drivers/char/Makefile            |    1 +
  drivers/char/hmm_dmirror.c       | 1168 +++++++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/migrate.h          |    8 +-
  include/uapi/linux/hmm_dmirror.h |   54 +++
  mm/hmm.c                         |    6 +-
  mm/migrate.c                     |  174 ++++++--
  7 files changed, 1388 insertions(+), 33 deletions(-)


2. API: As for the driver-kernel API: this is looking OK, although of course the documentation can 
be improved. As Jerome already explained, there are missing pieces functionality[1] that will be 
added later, and this may change the API, but for now, OK. With this initial API, we can handle both 
"device" and CPU page faults, and migrate pages around.

3. More testing plans: TODO: there are a lot of programs that can be easily modified, to use malloc 
instead of a special device-centric allocator. On our list.

4. Stability: still a little shaky, as we have some pretty recent bug fixes to try out.

5. Performance: I'll send out another note for that at some point. There was a performance bug that 
Jerome just recently fixed, and I want to see how it looks with that fix applied. No real surprises 
though.

6. Code reviews: the large size of the patchset, plus the requirement for a complicated driver to 
exercise it, makes it less likely for other people to review this patch series. It's a bit 
chicken-and-eggy, too, because our UVM driver can't be checked in and shipped until the kernel API 
stabilizes. heh.

-----

[1] For example, due to lacking file-backed memory support, some userspace program variables that 
are file-backed (initialized globals, etc) have to be mapped (from the device) instead of migrated 
to the device, on a device fault.

thanks,
john h

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [HMM v17 09/14] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2017-01-27 22:52 ` [HMM v17 09/14] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
@ 2017-03-13 22:10   ` Krishna Reddy
  0 siblings, 0 replies; 23+ messages in thread
From: Krishna Reddy @ 2017-03-13 22:10 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Naoya Horiguchi, David Nellans, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device  */ struct 
+hmm_mirror_ops {
+	/* update() - update virtual address range of memory
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update: update's type (turn read only, unmap, ...)
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
.......
+	 */
+	void (*update)(struct hmm_mirror *mirror,
+		       enum hmm_update action,
+		       unsigned long start,
+		       unsigned long end);
+};

minor arg documentation issue. @update should be @action. 

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-03-13 22:11 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-27 22:52 [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 01/14] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 02/14] mm/ZONE_DEVICE/free-page: callback when page is freed v2 Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 03/14] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v3 Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 04/14] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 05/14] mm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 06/14] mm/migrate: new memory migration helper for use with device memory v3 Jérôme Glisse
2017-02-22 15:31   ` Reza Arbab
2017-01-27 22:52 ` [HMM v17 07/14] mm/migrate: migrate_vma() unmap page from vma while collecting pages Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 08/14] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 09/14] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2017-03-13 22:10   ` Krishna Reddy
2017-01-27 22:52 ` [HMM v17 10/14] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 11/14] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 12/14] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 13/14] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE Jérôme Glisse
2017-01-27 22:52 ` [HMM v17 14/14] mm/hmm/devmem: dummy HMM device for ZONE_DEVICE memory v2 Jérôme Glisse
2017-01-28  5:39 ` [HMM v17 00/14] HMM (Heterogeneous Memory Management) v17 John Hubbard
2017-02-22  7:19 ` Balbir Singh
2017-02-22  8:16   ` Andrew Morton
2017-02-22  8:27     ` Balbir Singh
2017-02-22 15:49       ` Reza Arbab
2017-02-22 23:58       ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).