linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13
@ 2016-11-18 18:18 Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
                   ` (19 more replies)
  0 siblings, 20 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Cliff note: HMM offers 2 things (each standing on its own). First
it allows to use device memory transparently inside any process
without any modifications to process program code. Second it allows
to mirror process address space on a device.

Change since v12 is the use of struct page for device memory even if
the device memory is not accessible by the CPU (because of limitation
impose by the bus between the CPU and the device).

Using struct page means that their are minimal changes to core mm
code. HMM build on top of ZONE_DEVICE to provide struct page, it
adds new features to ZONE_DEVICE. The first 7 patches implement
those changes.

Rest of patchset is divided into 3 features that can each be use
independently from one another. First is the process address space
mirroring (patch 9 to 13), this allow to snapshot CPU page table
and to keep the device page table synchronize with the CPU one.

Second is a new memory migration helper which allow migration of
a range of virtual address of a process. This memory migration
also allow device to use their own DMA engine to perform the copy
between the source memory and destination memory. This can be
usefull even outside HMM context in many usecase.

Third part of the patchset (patch 17-18) is a set of helper to
register a ZONE_DEVICE node and manage it. It is meant as a
convenient helper so that device drivers do not each have to
reimplement over and over the same boiler plate code.


I am hoping that this can now be consider for inclusion upstream.
Bottom line is that without HMM we can not support some of the new
hardware features on x86 PCIE. I do believe we need some solution
to support those features or we won't be able to use such hardware
in standard like C++17, OpenCL 3.0 and others.

I have been working with NVidia to bring up this feature on their
Pascal GPU. There are real hardware that you can buy today that
could benefit from HMM. We also intend to leverage this inside the
open source nouveau driver.


In this patchset i restricted myself to set of core features what
is missing:
  - force read only on CPU for memory duplication and GPU atomic
  - changes to mmu_notifier for optimization purposes
  - migration of file back page to device memory

I plan to submit a couple more patchset to implement those feature
once core HMM is upstream.


Is there anything blocking HMM inclusion ? Something fundamental ?


Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2

Cheers,
Jérôme

Jérôme Glisse (18):
  mm/memory/hotplug: convert device parameter bool to set of flags
  mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
    memory
  mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages
  mm/ZONE_DEVICE/free-page: callback when page is freed
  mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device
    memory
  mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  mm/hmm: heterogeneous memory management (HMM for short)
  mm/hmm/mirror: mirror process address space on device with HMM helpers
  mm/hmm/mirror: add range lock helper, prevent CPU page table update
    for the range
  mm/hmm/mirror: add range monitor helper, to monitor CPU page table
    update
  mm/hmm/mirror: helper to snapshot CPU page table
  mm/hmm/mirror: device page fault handler
  mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  mm/hmm/migrate: add new boolean copy flag to migratepage() callback
  mm/hmm/migrate: new memory migration helper for use with device memory
  mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
  mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory

 MAINTAINERS                                |    7 +
 arch/ia64/mm/init.c                        |   19 +-
 arch/powerpc/mm/mem.c                      |   18 +-
 arch/s390/mm/init.c                        |   10 +-
 arch/sh/mm/init.c                          |   18 +-
 arch/tile/mm/init.c                        |   10 +-
 arch/x86/mm/init_32.c                      |   19 +-
 arch/x86/mm/init_64.c                      |   23 +-
 drivers/dax/pmem.c                         |    3 +-
 drivers/nvdimm/pmem.c                      |    5 +-
 drivers/staging/lustre/lustre/llite/rw26.c |    8 +-
 fs/aio.c                                   |    7 +-
 fs/btrfs/disk-io.c                         |   11 +-
 fs/hugetlbfs/inode.c                       |    9 +-
 fs/nfs/internal.h                          |    5 +-
 fs/nfs/write.c                             |    9 +-
 fs/proc/task_mmu.c                         |   10 +-
 fs/ubifs/file.c                            |    8 +-
 include/linux/balloon_compaction.h         |    3 +-
 include/linux/fs.h                         |   13 +-
 include/linux/hmm.h                        |  516 ++++++++++++
 include/linux/memory_hotplug.h             |   17 +-
 include/linux/memremap.h                   |   39 +-
 include/linux/migrate.h                    |    7 +-
 include/linux/mm_types.h                   |    5 +
 include/linux/swap.h                       |   18 +-
 include/linux/swapops.h                    |   67 ++
 kernel/fork.c                              |    2 +
 kernel/memremap.c                          |   48 +-
 mm/Kconfig                                 |   23 +
 mm/Makefile                                |    1 +
 mm/balloon_compaction.c                    |    2 +-
 mm/hmm.c                                   | 1175 ++++++++++++++++++++++++++++
 mm/memory.c                                |   33 +
 mm/memory_hotplug.c                        |    4 +-
 mm/migrate.c                               |  651 ++++++++++++++-
 mm/mprotect.c                              |   12 +
 mm/page_alloc.c                            |   10 +
 mm/rmap.c                                  |   47 ++
 tools/testing/nvdimm/test/iomap.c          |    2 +-
 40 files changed, 2811 insertions(+), 83 deletions(-)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.4.3

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  0:44   ` Balbir Singh
  2016-11-21  6:41   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
                   ` (18 subsequent siblings)
  19 siblings, 2 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

Only usefull for arch where we support ZONE_DEVICE and where we want to
also support un-addressable device memory. We need struct page for such
un-addressable memory. But we should avoid populating the kernel linear
mapping for the physical address range because there is no real memory
or anything behind those physical address.

Hence we need more flags than just knowing if it is device memory or not.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/ia64/mm/init.c            | 19 ++++++++++++++++---
 arch/powerpc/mm/mem.c          | 18 +++++++++++++++---
 arch/s390/mm/init.c            | 10 ++++++++--
 arch/sh/mm/init.c              | 18 +++++++++++++++---
 arch/tile/mm/init.c            | 10 ++++++++--
 arch/x86/mm/init_32.c          | 19 ++++++++++++++++---
 arch/x86/mm/init_64.c          | 19 ++++++++++++++++---
 include/linux/memory_hotplug.h | 17 +++++++++++++++--
 kernel/memremap.c              |  4 ++--
 mm/memory_hotplug.c            |  4 ++--
 10 files changed, 113 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 1841ef6..95a2fa5 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,7 +645,7 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	pg_data_t *pgdat;
 	struct zone *zone;
@@ -653,10 +653,17 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 
 	if (ret)
@@ -667,13 +674,19 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..e3c0532 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,7 +126,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdata;
 	struct zone *zone;
@@ -134,6 +134,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int rc;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
@@ -147,18 +153,24 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 
 	/* this should work for most non-highmem platforms */
 	zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, 0, for_device);
+		zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
+	
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
 
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index f56a39b..4147b87 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -149,7 +149,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	unsigned long normal_end_pfn = PFN_DOWN(memblock_end_of_DRAM());
 	unsigned long dma_end_pfn = PFN_DOWN(MAX_DMA_ADDRESS);
@@ -158,6 +158,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	unsigned long nr_pages;
 	int rc, zone_enum;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
@@ -197,7 +203,7 @@ unsigned long memory_block_size_bytes(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..f72a402 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,19 +485,25 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	pg_data_t *pgdat;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
 	ret = __add_pages(nid, pgdat->node_zones +
 			zone_for_memory(nid, start, size, ZONE_NORMAL,
-			for_device),
+					flags & MEMORY_DEVICE),
 			start_pfn, nr_pages);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
@@ -516,13 +522,19 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (unlikely(ret))
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index adce254..5fd972c 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -863,13 +863,19 @@ void __init mem_init(void)
  * memory to the highmem for now.
  */
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size, bool for_device)
+int arch_add_memory(u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdata = &contig_page_data;
 	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(zone, start_pfn, nr_pages);
 }
 
@@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/* TODO */
 	return -EBUSY;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cf80590..16a9095 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,24 +816,37 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdata = NODE_DATA(nid);
 	struct zone *zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
+		zone_for_memory(nid, start, size, ZONE_HIGHMEM,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	return __remove_pages(zone, start_pfn, nr_pages);
 }
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 14b9dd7..8c4abb0 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -651,15 +651,22 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -956,7 +963,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -965,6 +972,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	struct zone *zone;
 	int ret;
 
+	/* Need to add support for device and unaddressable memory if needed */
+	if (flags & MEMORY_UNADDRESSABLE) {
+		BUG();
+		return -EINVAL;
+	}
+
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 01033fa..ba9b12e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -103,7 +103,7 @@ extern bool memhp_auto_online;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, int flags);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -275,7 +275,20 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+
+/*
+ * For device memory we want more informations than just knowing it is device
+ * memory. We want to know if we can migrate it (ie it is not storage memory
+ * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
+ * memory can not be access by CPU but we still want struct page so that we
+ * can use it like regular memory.
+ */
+#define MEMORY_FLAGS_NONE 0
+#define MEMORY_DEVICE (1 << 0)
+#define MEMORY_MOVABLE (1 << 1)
+#define MEMORY_UNADDRESSABLE (1 << 2)
+
+extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..07665eb 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size);
+	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -358,7 +358,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, true);
+	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
 	if (error)
 		goto err_add_memory;
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9629273..b2942d7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1386,7 +1386,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	}
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, false);
+	ret = arch_add_memory(nid, start, size, MEMORY_FLAGS_NONE);
 
 	if (ret < 0)
 		goto error;
@@ -2205,7 +2205,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
+	arch_remove_memory(start, size, MEMORY_FLAGS_NONE);
 
 	try_offline_node(nid);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  8:06   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

This add support for un-addressable device memory. Such memory is hotpluged
only so we can have struct page but should never be map. This patch add code
to mm page fault code path to catch any such mapping and SIGBUS on such event.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/dax/pmem.c                |  3 ++-
 drivers/nvdimm/pmem.c             |  5 +++--
 include/linux/memremap.h          | 23 ++++++++++++++++++++---
 kernel/memremap.c                 | 12 +++++++++---
 mm/memory.c                       |  9 +++++++++
 tools/testing/nvdimm/test/iomap.c |  2 +-
 6 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 1f01e98..1b42aef 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -107,7 +107,8 @@ static int dax_pmem_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap);
+	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
+				   altmap, NULL, MEMORY_DEVICE);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 571a6c7..5ffd937 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
-				altmap);
+					   altmap, NULL, MEMORY_DEVICE);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -269,7 +269,8 @@ static int pmem_attach_disk(struct device *dev,
 		res->start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
-				&q->q_usage_counter, NULL);
+					   &q->q_usage_counter,
+					   NULL, NULL, MEMORY_DEVICE);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..fe61dca 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,22 +41,34 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
  */
 struct dev_pagemap {
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	int flags;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap);
+			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  struct dev_pagemap **ppgmap, int flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return ((page_zonenum(page) != ZONE_DEVICE) ||
+		!(page->pgmap->flags & MEMORY_UNADDRESSABLE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
-		struct resource *res, struct percpu_ref *ref,
-		struct vmem_altmap *altmap)
+					struct resource *res,
+					struct percpu_ref *ref,
+					struct vmem_altmap *altmap,
+					struct dev_pagemap **ppgmap,
+					int flags)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
@@ -71,6 +83,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return true;
+}
 #endif
 
 /**
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07665eb..438a73aa2 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
+	arch_remove_memory(align_start, align_size, pgmap->flags);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -270,6 +270,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  * @res: "host memory" address range
  * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
+ * @ppgmap: pointer set to new page dev_pagemap on success
+ * @flags: flag for memory (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -280,7 +282,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *    this is not enforced.
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap)
+			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  struct dev_pagemap **ppgmap, int flags)
 {
 	resource_size_t key, align_start, align_size, align_end;
 	pgprot_t pgprot = PAGE_KERNEL;
@@ -322,6 +325,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->flags = flags | MEMORY_DEVICE;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
@@ -358,7 +362,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
+	error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
 	if (error)
 		goto err_add_memory;
 
@@ -375,6 +379,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		page->pgmap = pgmap;
 	}
 	devres_add(dev, page_map);
+	if (ppgmap)
+		*ppgmap = pgmap;
 	return __va(res->start);
 
  err_add_memory:
diff --git a/mm/memory.c b/mm/memory.c
index 840adc6..15f2908 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -3482,6 +3483,7 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 static int handle_pte_fault(struct fault_env *fe)
 {
 	pte_t entry;
+	struct page *page;
 
 	if (unlikely(pmd_none(*fe->pmd))) {
 		/*
@@ -3533,6 +3535,13 @@ static int handle_pte_fault(struct fault_env *fe)
 	if (pte_protnone(entry) && vma_is_accessible(fe->vma))
 		return do_numa_page(fe, entry);
 
+	/* Catch mapping of un-addressable memory this should never happen */
+	page = pfn_to_page(pte_pfn(entry));
+	if (!is_addressable_page(page)) {
+		print_bad_pte(fe->vma, fe->address, entry, page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
 	spin_lock(fe->ptl);
 	if (unlikely(!pte_same(*fe->pte, entry)))
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index c29f8dc..899d6a8 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -108,7 +108,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,
 
 	if (nfit_res)
 		return nfit_res->buf + offset - nfit_res->res->start;
-	return devm_memremap_pages(dev, res, ref, altmap);
+	return devm_memremap_pages(dev, res, ref, altmap, MEMORY_DEVICE);
 }
 EXPORT_SYMBOL(__wrap_devm_memremap_pages);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  8:18   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
happen as ZONE_DEVICE page must always have an elevated refcount.

This is to catch refcounting issues in a sane way for ZONE_DEVICE pages.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fbfead..09b2630 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2435,6 +2435,16 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	/*
+	 * This should never happen ! Page from ZONE_DEVICE always must have an
+	 * active refcount. Complain about it and try to restore the refcount.
+	 */
+	if (is_zone_device_page(page)) {
+		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
+		page_ref_inc(page);
+		return;
+	}
+
 	if (!free_pcp_prepare(page))
 		return;
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  1:49   ` Balbir Singh
  2016-11-21  8:26   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
is holding a reference on it (only device to which the memory belong do).
Add a callback and call it when that happen so device driver can implement
their own free page management.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h | 4 ++++
 kernel/memremap.c        | 8 ++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index fe61dca..469c88d 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -37,17 +37,21 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @free_devpage: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @data: privata data pointer for free_devpage
  * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
  */
 struct dev_pagemap {
+	void (*free_devpage)(struct page *page, void *data);
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
 	int flags;
 };
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 438a73aa2..3d28048 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (page->pgmap->free_devpage && page_count(page) == 1)
+		page->pgmap->free_devpage(page, page->pgmap->data);
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
@@ -326,6 +332,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
 	pgmap->flags = flags | MEMORY_DEVICE;
+	pgmap->free_devpage = NULL;
+	pgmap->data = NULL;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21 10:37   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable Jérôme Glisse
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

HMM wants to remove device memory early before device tear down so add an
helper to do that.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h |  7 +++++++
 kernel/memremap.c        | 14 ++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 469c88d..b6f03e9 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -60,6 +60,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 			  struct percpu_ref *ref, struct vmem_altmap *altmap,
 			  struct dev_pagemap **ppgmap, int flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
 static inline bool is_addressable_page(const struct page *page)
 {
@@ -88,6 +89,12 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
+static inline int devm_memremap_pages_remove(struct device *dev,
+					     struct dev_pagemap *pgmap)
+{
+	return -EINVAL;
+}
+
 static inline bool is_addressable_page(const struct page *page)
 {
 	return true;
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 3d28048..cf83928 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -401,6 +401,20 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 }
 EXPORT_SYMBOL(devm_memremap_pages);
 
+static int devm_page_map_match(struct device *dev, void *data, void *match_data)
+{
+	struct page_map *page_map = data;
+
+	return &page_map->pgmap == match_data;
+}
+
+int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap)
+{
+	return devres_release(dev, &devm_memremap_pages_release,
+			      &devm_page_map_match, pgmap);
+}
+EXPORT_SYMBOL(devm_memremap_pages_remove);
+
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
 	/* number of pfns from base where pfn_to_page() is valid */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  2:06   ` Balbir Singh
  2016-11-21 10:58   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
                   ` (13 subsequent siblings)
  19 siblings, 2 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

To allow use of device un-addressable memory inside a process add a
special swap type. Also add a new callback to handle page fault on
such entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/proc/task_mmu.c       | 10 +++++++-
 include/linux/memremap.h |  5 ++++
 include/linux/swap.h     | 18 ++++++++++---
 include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c        | 14 ++++++++++
 mm/Kconfig               | 12 +++++++++
 mm/memory.c              | 24 +++++++++++++++++
 mm/mprotect.c            | 12 +++++++++
 8 files changed, 158 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6909582..0726d39 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			} else {
 				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
 			}
-		} else if (is_migration_entry(swpent))
+		} else if (is_migration_entry(swpent)) {
 			page = migration_entry_to_page(swpent);
+		} else if (is_device_entry(swpent)) {
+			page = device_entry_to_page(swpent);
+		}
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -708,6 +711,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1191,6 +1196,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_entry(entry))
+			page = device_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index b6f03e9..d584c74 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -47,6 +47,11 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
  */
 struct dev_pagemap {
 	void (*free_devpage)(struct page *page, void *data);
+	int (*fault)(struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned flags,
+		     pmd_t *pmdp);
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7e553e1..599cb54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Un-addressable device memory support
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -442,8 +454,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..d1aa425 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index cf83928..0670015 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -200,6 +202,18 @@ void put_zone_device_page(struct page *page)
 }
 EXPORT_SYMBOL(put_zone_device_page);
 
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_entry_to_page(entry);
+
+	return page->pgmap->fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..0a21411 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -704,6 +704,18 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_UNADDRESSABLE
+	bool "Un-addressable device memory (GPU memory, ...)"
+	depends on ZONE_DEVICE
+
+	help
+	  Allow to create struct page for un-addressable device memory
+	  ie memory that is only accessible by the device (or group of
+	  devices).
+
+	  This allow to migrate chunk of process memory to device memory
+	  while that memory is use by the device.
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index 15f2908..a83d690 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -889,6 +889,21 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_entry(entry)) {
+			page = device_entry_to_page(entry);
+
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			if (is_write_device_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*src_pte))
+					pte = pte_swp_mksoft_dirty(pte);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1191,6 +1206,12 @@ again:
 
 			page = migration_entry_to_page(entry);
 			rss[mm_counter(page)]--;
+		} else if (is_device_entry(entry)) {
+			struct page *page = device_entry_to_page(entry);
+			rss[mm_counter(page)]--;
+
+			page_remove_rmap(page, false);
+			put_page(page);
 		}
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
@@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
+		} else if (is_device_entry(entry)) {
+			ret = device_entry_fault(vma, fe->address, entry,
+						 fe->flags, fe->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1bc1eb3..70aff3a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_entry(entry)) {
+				pte_t newpte;
+
+				make_device_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  2:08   ` Balbir Singh
  2016-11-18 18:18 ` [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.

Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/mm/init_64.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8c4abb0..556f7bb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -661,13 +661,17 @@ int arch_add_memory(int nid, u64 start, u64 size, int flags)
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
-	/* Need to add support for device and unaddressable memory if needed */
-	if (flags & MEMORY_UNADDRESSABLE) {
-		BUG();
-		return -EINVAL;
-	}
-
-	init_memory_mapping(start, start + size);
+	/*
+	 * We get un-addressable memory when some one is adding a ZONE_DEVICE
+	 * to have struct page for a device memory which is not accessible by
+	 * the CPU so it is pointless to have a linear kernel mapping of such
+	 * memory.
+	 *
+	 * Core mm should make sure it never set a pte pointing to such fake
+	 * physical range.
+	 */
+	if (!(flags & MEMORY_UNADDRESSABLE))
+		init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
@@ -972,12 +976,6 @@ int __ref arch_remove_memory(u64 start, u64 size, int flags)
 	struct zone *zone;
 	int ret;
 
-	/* Need to add support for device and unaddressable memory if needed */
-	if (flags & MEMORY_UNADDRESSABLE) {
-		BUG();
-		return -EINVAL;
-	}
-
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
@@ -985,7 +983,9 @@ int __ref arch_remove_memory(u64 start, u64 size, int flags)
 	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-	kernel_physical_mapping_remove(start, start + size);
+
+	if (!(flags & MEMORY_UNADDRESSABLE))
+		kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  2:29   ` Balbir Singh
  2016-11-23  4:03   ` Anshuman Khandual
  2016-11-18 18:18 ` [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
                   ` (11 subsequent siblings)
  19 siblings, 2 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

HMM provides 3 separate functionality :
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 MAINTAINERS              |   7 +++
 include/linux/hmm.h      | 139 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |  11 ++++
 mm/Makefile              |   1 +
 mm/hmm.c                 |  86 +++++++++++++++++++++++++++++
 7 files changed, 251 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f593300..41cd63d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5582,6 +5582,13 @@ S:	Supported
 F:	drivers/scsi/hisi_sas/
 F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..54dd529
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,139 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * HMM provides 3 separate functionality :
+ *   - Mirroring: synchronize CPU page table and device page table
+ *   - Device memory: allocating struct page for device memory
+ *   - Migration: migrating regular memory to device memory
+ *
+ * Each can be use independently from the others.
+ *
+ *
+ * Mirroring:
+ *
+ * HMM provide helpers to mirror process address space on a device. For this it
+ * provides several helpers to order device page table update in respect to CPU
+ * page table update. Requirement is that for any given virtual address the CPU
+ * and device page table can not point to different physical page. It uses the
+ * mmu_notifier API and introduce virtual address range lock which block CPU
+ * page table update for a range while the device page table is being updated.
+ * Usage pattern is:
+ *
+ *      hmm_vma_range_lock(vma, start, end);
+ *      // snap shot CPU page table
+ *      // update device page table from snapshot
+ *      hmm_vma_range_unlock(vma, start, end);
+ *
+ * Any CPU page table update that conflict with a range lock will wait until
+ * range is unlock. This garanty proper serialization of CPU and device page
+ * table update.
+ *
+ *
+ * Device memory:
+ *
+ * HMM provides helpers to help leverage device memory either addressable like
+ * regular memory by the CPU or un-addressable at all. In both case the device
+ * memory is associated to dedicated structs page (which are allocated like for
+ * hotplug memory). Device memory management is under the responsability of the
+ * device driver. HMM only allocate and initialize the struct pages associated
+ * with the device memory.
+ *
+ * Allocating struct page for device memory allow to use device memory allmost
+ * like any regular memory. Unlike regular memory it can not be added to the
+ * lru, nor can any memory allocation can use device memory directly. Device
+ * memory will only end up to be use in a process if device driver migrate some
+ * of the process memory from regular memory to device memory.
+ *
+ *
+ * Migration:
+ *
+ * Existing memory migration mechanism (mm/migrate.c) does not allow to use
+ * something else than the CPU to copy from source to destination memory. More
+ * over existing code is not tailor to drive migration from process virtual
+ * address rather than from list of pages. Finaly the migration flow does not
+ * allow for graceful failure at different step of the migration process.
+ *
+ * HMM solves all of the above though simple API :
+ *
+ *      hmm_vma_migrate(vma, start, end, ops);
+ *
+ * With ops struct providing 2 callback alloc_and_copy() which allocated the
+ * destination memory and initialize it using source memory. Migration can fail
+ * after this step and thus last callback finalize_and_map() allow the device
+ * driver to know which page were successfully migrated and which were not.
+ *
+ * This can easily be use outside of HMM intended use case.
+ *
+ *
+ * This header file contain all the API related to this 3 functionality and
+ * each functions and struct are more thouroughly documented in below comments.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table have the write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only ! Not to be use by device driver ! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only ! Not to be use by device driver ! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8aced..4effdbf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -516,6 +517,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM need to track few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index 690a1aad..af0eec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -702,6 +703,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 0a21411..be18cc2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,17 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config HMM
+	bool "Heterogeneous memory management (HMM)"
+	depends on MMU
+	default n
+	help
+	  Heterogeneous memory management, set of helpers for:
+	    - mirroring of process address space on a device
+	    - using device memory transparently inside a process
+
+	  If unsure, say N to disable HMM.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf..6ac1284 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..342b596
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,86 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for informations about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be use directly by device driver but by other HMM
+ * component. It allocates an HMM struct if mm does not have one and initialize
+ * it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	struct hmm *hmm = NULL;
+
+	if (!mm->hmm) {
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (!mm->hmm)
+		/*
+		 * The hmm struct can only be free once mm_struct goes away
+		 * hence we should always have pre-allocated an new hmm struct
+		 * above.
+		 */
+		mm->hmm = hmm;
+	else if (hmm)
+		kfree(hmm);
+	hmm = mm->hmm;
+	spin_unlock(&mm->page_table_lock);
+
+	return hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	/*
+	 * We should not need to lock here as no one should be able to register
+	 * a new HMM while an mm is being destroy. But just to be safe ...
+	 */
+	spin_lock(&mm->page_table_lock);
+	hmm = mm->hmm;
+	mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+	if (!hmm)
+		return;
+
+	kfree(hmm);
+}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-21  2:42   ` Balbir Singh
  2016-11-18 18:18 ` [HMM v13 10/18] mm/hmm/mirror: add range lock helper, prevent CPU page table update for the range Jérôme Glisse
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is usefull for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  97 +++++++++++++++++++++++++++++++
 mm/hmm.c            | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 54dd529..f44e270 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,6 +88,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
@@ -127,6 +128,102 @@ static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
 }
 
 
+/*
+ * Mirroring: how to use synchronize device page table with CPU page table ?
+ *
+ * Device driver must always synchronize with CPU page table update, for this
+ * they can either directly use mmu_notifier API or they can use the hmm_mirror
+ * API. Device driver can decide to register one mirror per device per process
+ * or just one mirror per process for a group of device. Pattern is :
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *          int ret;
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contain an hmm_mirror struct as one of its field.
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *          // Other device driver specific initialization
+ *      }
+ *
+ * Device driver must not free the struct containing hmm_mirror struct before
+ * calling hmm_mirror_unregister() expected usage is to do that when device
+ * driver is unbinding from an address space.
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          hmm_mirror_unregister(&das->mirror);
+ *          // Other device driver specific cleanup and now das can be free
+ *      }
+ *
+ * Once an hmm_mirror is register for an address space, device driver will get
+ * callback through the update() operation (see hmm_mirror_ops struct).
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* update() - update virtual address range of memory
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update: update's type (turn read only, unmap, ...)
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback is call when the CPU page table is updated, the device
+	 * driver must update device page table accordingly to update's action.
+	 *
+	 * Device driver callback must wait until device have fully updated its
+	 * view for the range. Note we plan to make this asynchronous in later
+	 * patches. So that multiple devices can schedule update to their page
+	 * table and once all device have schedule the update then we wait for
+	 * them to propagate.
+	 */
+	void (*update)(struct hmm_mirror *mirror,
+		       enum hmm_update action,
+		       unsigned long start,
+		       unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * of hmm_mirror struct with HMM. HMM will track list of all mirrors for each
+ * mm_struct (or each process).
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 342b596..3594785 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,27 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting mirrors list
+ * @mirrors: list of mirrors for this mm
+ * @wait_queue: wait queue
+ * @sequence: we track update to CPU page table with a sequence number
+ * @mmu_notifier: mmu notifier to track update to CPU page table
+ * @notifier_count: number of currently active notifier count
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	atomic_t		sequence;
+	wait_queue_head_t	wait_queue;
+	struct mmu_notifier	mmu_notifier;
+	atomic_t		notifier_count;
 };
 
 /*
@@ -48,6 +61,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 		if (!hmm)
 			return NULL;
+		init_waitqueue_head(&hmm->wait_queue);
+		atomic_set(&hmm->notifier_count, 0);
+		INIT_LIST_HEAD(&hmm->mirrors);
+		atomic_set(&hmm->sequence, 0);
+		hmm->mmu_notifier.ops = NULL;
+		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 	}
 
@@ -84,3 +103,144 @@ void hmm_mm_destroy(struct mm_struct *mm)
 
 	kfree(hmm);
 }
+
+
+
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	/*
+	 * Mirror being added or remove is a rare event so list traversal isn't
+	 * protected by a lock, we rely on simple rules. All list modification
+	 * are done using list_add_rcu() and list_del_rcu() under a spinlock to
+	 * protect from concurrent addition or removal but not traversal.
+	 *
+	 * Because hmm_mirror_unregister() wait for all running invalidation to
+	 * complete (and thus all list traversal to finish). None of the mirror
+	 * struct can be freed from under us while traversing the list and thus
+	 * it is safe to dereference their list pointer even if they were just
+	 * remove.
+	 */
+	list_for_each_entry (mirror, &hmm->mirrors, list)
+		mirror->ops->update(mirror, action, start, end);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	/* Reverse order here because we are getting out of invalidation */
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space device driver must register an
+ * HMM mirror struct.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	/* Register mmu_notifier if not already, use mmap_sem for locking */
+	if (!mirror->hmm->mmu_notifier.ops) {
+		struct hmm *hmm = mirror->hmm;
+		down_write(&mm->mmap_sem);
+		if (!hmm->mmu_notifier.ops) {
+			hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+			if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+				hmm->mmu_notifier.ops = NULL;
+				up_write(&mm->mmap_sem);
+				return -ENOMEM;
+			}
+		}
+		up_write(&mm->mmap_sem);
+	}
+
+	spin_lock(&mirror->hmm->lock);
+	list_add_rcu(&mirror->list, &mirror->hmm->mirrors);
+	spin_unlock(&mirror->hmm->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&mirror->list);
+	spin_unlock(&hmm->lock);
+
+	/*
+	 * Wait for all active notifier so that it is safe to traverse mirror
+	 * list without any lock.
+	 */
+	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 10/18] mm/hmm/mirror: add range lock helper, prevent CPU page table update for the range
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 11/18] mm/hmm/mirror: add range monitor helper, to monitor CPU page table update Jérôme Glisse
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

There is two possible strategy when it comes to snapshoting the CPU page table
inside the device page table. First one snapshot the CPU page table and keep
track of active mmu_notifier callback. Once snapshot is done and before updating
the device page table (in an atomic fashion) it check the mmu_notifier sequence.
If sequence is same as the time the CPU page table was snapshot then it means
that no mmu_notifier run in the meantime and hence the snapshot is accurate. If
the sequence is different then one mmu_notifier callback did run and snapshot
might no longer be valid and the whole procedure must be restarted.

Issue with this approach is that it does not garanty forward progress for the
device driver trying to mirror a range of the address space.

The second solution, implemented by this patch, is to serialize CPU snapshot
with mmu_notifier callback and have each waiting on each other according to the
order they happen. This garanty forward progress for driver. The drawback is
that it can stall process waiting on the mmu_notifier callback to finish. So
thing like direct page reclaim (or even indirect one) might stall and this might
increase overall kernel latency.

For now just accept this potential issue and wait to have real world workload to
be affected by it before trying to fix it. Fix is probably to introduce a new
mmu_notifier_try_to_invalidate() that could return failure if it has to wait or
sleep and use it inside reclaim code to decide to skip to next candidate for
reclaimation.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  30 ++++++++++++
 mm/hmm.c            | 131 +++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 154 insertions(+), 7 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f44e270..c0b1c07 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -224,6 +224,36 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
 
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @hmm: core hmm struct this range is active against
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @waiting: pointer to range waiting on this one
+ * @wakeup: use to wakeup the range when it was waiting
+ */
+struct hmm_range {
+	struct hmm		*hmm;
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	struct hmm_range	*waiting;
+	bool			wakeup;
+};
+
+/*
+ * Range locking allow to garanty forward progress by blocking CPU page table
+ * invalidation. See functions description in mm/hmm.c for documentation.
+ */
+int hmm_vma_range_lock(struct hmm_range *range,
+		       struct vm_area_struct *vma,
+		       unsigned long start,
+		       unsigned long end);
+void hmm_vma_range_unlock(struct hmm_range *range);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 3594785..ee05419 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -27,7 +27,8 @@
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
- * @lock: lock protecting mirrors list
+ * @lock: lock protecting mirrors and ranges list
+ * @ranges: list of range lock (for snapshot and invalidation serialization)
  * @mirrors: list of mirrors for this mm
  * @wait_queue: wait queue
  * @sequence: we track update to CPU page table with a sequence number
@@ -37,6 +38,7 @@
 struct hmm {
 	struct mm_struct	*mm;
 	spinlock_t		lock;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	atomic_t		sequence;
 	wait_queue_head_t	wait_queue;
@@ -66,6 +68,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		INIT_LIST_HEAD(&hmm->mirrors);
 		atomic_set(&hmm->sequence, 0);
 		hmm->mmu_notifier.ops = NULL;
+		INIT_LIST_HEAD(&hmm->ranges);
 		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 	}
@@ -104,16 +107,48 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	kfree(hmm);
 }
 
-
-
 static void hmm_invalidate_range(struct hmm *hmm,
 				 enum hmm_update action,
 				 unsigned long start,
 				 unsigned long end)
 {
+	struct hmm_range range, *tmp;
 	struct hmm_mirror *mirror;
 
 	/*
+	 * Serialize invalidation with CPU snapshot (see hmm_vma_range_lock()).
+	 * Need to make change to mmu_notifier so that we can get a struct that
+	 * stay alive accross call to mmu_notifier_invalidate_range_start() and
+	 * mmu_notifier_invalidate_range_end(). FIXME !
+	 */
+	range.waiting = NULL;
+	range.start = start;
+	range.end = end;
+	range.hmm = hmm;
+
+	spin_lock(&hmm->lock);
+	list_for_each_entry (tmp, &hmm->ranges, list) {
+		if (range.start >= tmp->end || range.end <= tmp->start)
+			continue;
+
+		while (tmp->waiting)
+			tmp = tmp->waiting;
+
+		list_add(&range.list, &hmm->ranges);
+		tmp->waiting = &range;
+		range.wakeup = false;
+		spin_unlock(&hmm->lock);
+
+		wait_event(hmm->wait_queue, range.wakeup);
+		return;
+	}
+	list_add(&range.list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+
+	/*
 	 * Mirror being added or remove is a rare event so list traversal isn't
 	 * protected by a lock, we rely on simple rules. All list modification
 	 * are done using list_add_rcu() and list_del_rcu() under a spinlock to
@@ -127,6 +162,9 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	 */
 	list_for_each_entry (mirror, &hmm->mirrors, list)
 		mirror->ops->update(mirror, action, start, end);
+
+	/* See above FIXME */
+	hmm_vma_range_unlock(&range);
 }
 
 static void hmm_invalidate_page(struct mmu_notifier *mn,
@@ -139,8 +177,6 @@ static void hmm_invalidate_page(struct mmu_notifier *mn,
 
 	VM_BUG_ON(!hmm);
 
-	atomic_inc(&hmm->notifier_count);
-	atomic_inc(&hmm->sequence);
 	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
 	atomic_dec(&hmm->notifier_count);
 	wake_up(&hmm->wait_queue);
@@ -155,8 +191,6 @@ static void hmm_invalidate_range_start(struct mmu_notifier *mn,
 
 	VM_BUG_ON(!hmm);
 
-	atomic_inc(&hmm->notifier_count);
-	atomic_inc(&hmm->sequence);
 	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
 }
 
@@ -244,3 +278,86 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+
+/*
+ * hmm_vma_range_lock() - lock invalidation of a virtual address range
+ * @range: range lock struct provided by caller to track lock while valid
+ * @vma: virtual memory area containing the virtual address range
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * Returns: -EINVAL or -ENOMEM on error, 0 otherwise
+ *
+ * This will block any invalidation to CPU page table for the range of virtual
+ * address provided as argument. Design pattern is :
+ *      hmm_vma_range_lock(vma, start, end, lock);
+ *      hmm_vma_range_get_pfns(vma, start, end, pfns);
+ *      // Device driver goes over each pfn in the pfns array, snapshot of CPU
+ *      // page table and take appropriate actions (use it to populate GPU page
+ *      // table, identify address that need faulting, prepare migration, ...)
+ *      hmm_vma_range_unlock(&lock);
+ *
+ * DO NOT HOLD THE RANGE LOCK FOR LONGER THAN NECESSARY ! THIS DOES BLOCK CPU
+ * PAGE TABLE INVALIDATION !
+ */
+int hmm_vma_range_lock(struct hmm_range *range,
+		       struct vm_area_struct *vma,
+		       unsigned long start,
+		       unsigned long end)
+{
+	struct hmm *hmm;
+
+	VM_BUG_ON(!vma);
+	VM_BUG_ON(!rwsem_is_locked(&vma->vm_mm->mmap_sem));
+
+	range->hmm = hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	range->waiting = NULL;
+	range->start = start;
+	range->end = end;
+
+	spin_lock(&hmm->lock);
+	list_add(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/*
+	 * Wait for all active mmu_notifier this is because we can not keep an
+	 * hmm_range struct around while mmu_notifier is between a start and
+	 * end section. This need change to mmu_notifier FIXME !
+	 */
+	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_range_lock);
+
+/*
+ * hmm_vma_range_unlock() - unlock invalidation of a virtual address range
+ * @lock: lock struct tracking the range lock
+ *
+ * See hmm_vma_range_lock() for usage.
+ */
+void hmm_vma_range_unlock(struct hmm_range *range)
+{
+	struct hmm *hmm = range->hmm;
+	bool wakeup = false;
+
+	spin_lock(&hmm->lock);
+	list_del(&range->list);
+	if (range->waiting) {
+		range->waiting->wakeup = true;
+		wakeup = true;
+	}
+	spin_unlock(&hmm->lock);
+
+	if (wakeup)
+		wake_up(&hmm->wait_queue);
+}
+EXPORT_SYMBOL(hmm_vma_range_unlock);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 11/18] mm/hmm/mirror: add range monitor helper, to monitor CPU page table update
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 10/18] mm/hmm/mirror: add range lock helper, prevent CPU page table update for the range Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 12/18] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

Complement the hmm_vma_range_lock/unlock() mechanism with a range monitor that do
not block CPU page table invalidation and thus do not garanty forward progress. It
is still usefull as in many situations concurrent CPU page table update and CPU
snapshot are taking place in different region of the virtual address space.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 18 ++++++++++
 mm/hmm.c            | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index c0b1c07..6571647 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -254,6 +254,24 @@ int hmm_vma_range_lock(struct hmm_range *range,
 void hmm_vma_range_unlock(struct hmm_range *range);
 
 
+/*
+ * Monitoring a range allow to track any CPU page table modification that can
+ * affect the range. It complements the hmm_vma_range_lock/unlock() mechanism
+ * as a non blocking method for synchronizing device page table with the CPU
+ * page table. See functions description in mm/hmm.c for documentation.
+ *
+ * NOTE AFTER A CALL TO hmm_vma_range_monitor_start() THAT RETURNED TRUE YOU
+ * MUST MAKE A CALL TO hmm_vma_range_monitor_end() BEFORE FREEING THE RANGE
+ * STRUCT OR BAD THING WILL HAPPEN !
+ */
+bool hmm_vma_range_monitor_start(struct hmm_range *range,
+				 struct vm_area_struct *vma,
+				 unsigned long start,
+				 unsigned long end,
+				 bool wait);
+bool hmm_vma_range_monitor_end(struct hmm_range *range);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index ee05419..746eb96 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -40,6 +40,7 @@ struct hmm {
 	spinlock_t		lock;
 	struct list_head	ranges;
 	struct list_head	mirrors;
+	struct list_head	monitors;
 	atomic_t		sequence;
 	wait_queue_head_t	wait_queue;
 	struct mmu_notifier	mmu_notifier;
@@ -65,6 +66,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 			return NULL;
 		init_waitqueue_head(&hmm->wait_queue);
 		atomic_set(&hmm->notifier_count, 0);
+		INIT_LIST_HEAD(&hmm->monitors);
 		INIT_LIST_HEAD(&hmm->mirrors);
 		atomic_set(&hmm->sequence, 0);
 		hmm->mmu_notifier.ops = NULL;
@@ -112,7 +114,7 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long start,
 				 unsigned long end)
 {
-	struct hmm_range range, *tmp;
+	struct hmm_range range, *tmp, *next;
 	struct hmm_mirror *mirror;
 
 	/*
@@ -127,6 +129,13 @@ static void hmm_invalidate_range(struct hmm *hmm,
 	range.hmm = hmm;
 
 	spin_lock(&hmm->lock);
+	/* Remove any range monitors */
+	list_for_each_entry_safe (tmp, next, &hmm->monitors, list) {
+		if (range.start >= tmp->end || range.end <= tmp->start)
+			continue;
+		/* This range is no longer valid */
+		list_del_init(&tmp->list);
+	}
 	list_for_each_entry (tmp, &hmm->ranges, list) {
 		if (range.start >= tmp->end || range.end <= tmp->start)
 			continue;
@@ -361,3 +370,87 @@ void hmm_vma_range_unlock(struct hmm_range *range)
 		wake_up(&hmm->wait_queue);
 }
 EXPORT_SYMBOL(hmm_vma_range_unlock);
+
+
+/*
+ * hmm_vma_range_monitor_start() - start monitoring of a range
+ * @range: pointer to hmm_range struct use to monitor
+ * @vma: virtual memory area for the range
+ * @start: start address of the range to monitor (inclusive)
+ * @end: end address of the range to monitor (exclusive)
+ * @wait: wait for any pending CPU page table to finish
+ * Returns: false if there is pendding CPU page table update, true otherwise
+ *
+ * The use pattern of this function is :
+ *   retry:
+ *       hmm_vma_range_monitor_start(range, vma, start, end, true);
+ *       // Do something that rely on stable CPU page table content but do not
+ *       // Prepare device page table update transaction
+ *       ...
+ *       // Take device driver lock that serialize device page table update
+ *       driver_lock_device_page_table_update();
+ *       if (!hmm_vma_range_monitor_end(range)) {
+ *           driver_unlock_device_page_table_update();
+ *           // Abort transaction you just build and cleanup anything that need
+ *           // to be. Same comment as above, about avoiding busy loop.
+ *           goto retry;
+ *       }
+ *       // Commit device page table update
+ *       driver_unlock_device_page_table_update();
+ */
+bool hmm_vma_range_monitor_start(struct hmm_range *range,
+				 struct vm_area_struct *vma,
+				 unsigned long start,
+				 unsigned long end,
+				 bool wait)
+{
+	BUG_ON(!vma);
+	BUG_ON(!range);
+
+	INIT_LIST_HEAD(&range->list);
+	range->hmm = hmm_register(vma->vm_mm);
+	if (!range->hmm)
+		return false;
+
+again:
+	spin_lock(&range->hmm->lock);
+	if (atomic_read(&range->hmm->notifier_count)) {
+		spin_unlock(&range->hmm->lock);
+		if (!wait)
+			return false;
+		/*
+		 * FIXME: Wait for all active mmu_notifier this is because we
+		 * can no keep an hmm_range struct around while waiting for
+		 * range invalidation to finish. Need to update mmu_notifier
+		 * to make this doable.
+		 */
+		wait_event(range->hmm->wait_queue,
+			   !atomic_read(&range->hmm->notifier_count));
+		goto again;
+	}
+	list_add_tail(&range->list, &range->hmm->monitors);
+	spin_unlock(&range->hmm->lock);
+	return true;
+}
+EXPORT_SYMBOL(hmm_vma_range_monitor_start);
+
+/*
+ * hmm_vma_range_monitor_end() - end monitoring of a range
+ * @range: range that was being monitored
+ * Returns: true if no invalidation since hmm_vma_range_monitor_start()
+ */
+bool hmm_vma_range_monitor_end(struct hmm_range *range)
+{
+	bool valid;
+
+	if (!range->hmm || list_empty(&range->list))
+		return false;
+
+	spin_lock(&range->hmm->lock);
+	valid = !list_empty(&range->list);
+	list_del_init(&range->list);
+	spin_unlock(&range->hmm->lock);
+
+	return valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_monitor_end);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 12/18] mm/hmm/mirror: helper to snapshot CPU page table
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 11/18] mm/hmm/mirror: add range monitor helper, to monitor CPU page table update Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 13/18] mm/hmm/mirror: device page fault handler Jérôme Glisse
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  30 +++++++++-
 mm/hmm.c            | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 191 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 6571647..9e0f00d 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -95,13 +95,28 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ: read permission set
  * HMM_PFN_WRITE: CPU page table have the write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
+ * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
+ *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
+ *      device (the entry will never have HMM_PFN_VALID set and the pfn value
+ *      is undefine)
+ * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_DEVICE (1 << 5)
+#define HMM_PFN_SPECIAL (1 << 6)
+#define HMM_PFN_UNADDRESSABLE (1 << 7)
+#define HMM_PFN_SHIFT 8
 
 static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
 {
@@ -272,6 +287,17 @@ bool hmm_vma_range_monitor_start(struct hmm_range *range,
 bool hmm_vma_range_monitor_end(struct hmm_range *range);
 
 
+/*
+ * Snapshot CPU page table, the snapshot content validity can be track using
+ * hmm_range_monitor_start/end() or hmm_vma_range_lock()/hmm_vma_range_unlock()
+ * mechanism. See function description in mm/hmm.c for documentation.
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 746eb96..f2ea76b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,10 +19,15 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
+
 /*
  * struct hmm - HMM per mm struct
  *
@@ -454,3 +459,161 @@ bool hmm_vma_range_monitor_end(struct hmm_range *range)
 	return valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_monitor_end);
+
+
+static void hmm_pfns_empty(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_EMPTY;
+}
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static void hmm_vma_walk(struct vm_area_struct *vma,
+			 unsigned long start,
+			 unsigned long end,
+			 hmm_pfn_t *pfns)
+{
+	unsigned long addr, next;
+	hmm_pfn_t flag;
+
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	for (addr = start; addr < end; addr = next) {
+		unsigned long i = (addr - start) >> PAGE_SHIFT;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+		pmd_t pmd;
+
+		/*
+		 * We are accessing/faulting for a device from an unknown
+		 * thread that might be foreign to the mm we are faulting
+		 * against so do not call arch_vma_access_permitted() !
+		 */
+
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset(vma->vm_mm, addr);
+		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		if (pud_none(*pudp) || pud_bad(*pudp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd) || pmd_bad(pmd)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+			hmm_pfn_t flags = flag;
+
+			if (pmd_protnone(pmd)) {
+				hmm_pfns_clear(&pfns[i], addr, next);
+				continue;
+			}
+			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
+			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
+			continue;
+		}
+
+		ptep = pte_offset_map(pmdp, addr);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			pte_t pte = *ptep;
+
+			pfns[i] = 0;
+
+			if (pte_none(pte)) {
+				pfns[i] = HMM_PFN_EMPTY;
+				continue;
+			}
+
+			entry = pte_to_swp_entry(pte);
+			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				continue;
+			}
+
+			if (pte_present(pte)) {
+				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
+				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+				continue;
+			}
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			*/
+			if (is_device_entry(entry)) {
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
+				if (is_write_device_entry(entry))
+					pfns[i] |= HMM_PFN_WRITE;
+				pfns[i] |= HMM_PFN_DEVICE;
+				pfns[i] |= HMM_PFN_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (!is_migration_entry(entry)) {
+				pfns[i] = HMM_PFN_ERROR;
+			}
+		}
+		pte_unmap(ptep - 1);
+	}
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual address
+ * @vma: virtual memory area containing the virtual address range
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t provided by caller fill by function
+ * Returns: -EINVAL if invalid argument, 0 otherwise
+ *
+ * This snapshot the CPU page table for a range of virtual address, snapshot is
+ * only valid while protected by hmm_vma_range_lock() or if return cookie value
+ * is still valid (see hmm_vma_check_cookie()).
+ *
+ * It will fill the pfns array using CPU pte. Note that any invalid CPU page
+ * table entry, at time of snapshot, can turn into a valid one after this
+ * function return but before calling hmm_vma_range_unlock().
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm_vma_walk(vma, start, end, pfns);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 13/18] mm/hmm/mirror: device page fault handler
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 12/18] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 14/18] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  33 ++++++-
 mm/hmm.c            | 262 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 267 insertions(+), 28 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 9e0f00d..c79abfc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -99,6 +99,7 @@ struct hmm;
  * HMM_PFN_WRITE: CPU page table have the write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
+ * HMM_PFN_FAULT: use by hmm_vma_fault() to signify which address need faulting
  * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
  *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
@@ -113,10 +114,11 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_WRITE (1 << 2)
 #define HMM_PFN_ERROR (1 << 3)
 #define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_DEVICE (1 << 5)
-#define HMM_PFN_SPECIAL (1 << 6)
-#define HMM_PFN_UNADDRESSABLE (1 << 7)
-#define HMM_PFN_SHIFT 8
+#define HMM_PFN_FAULT (1 << 5)
+#define HMM_PFN_DEVICE (1 << 6)
+#define HMM_PFN_SPECIAL (1 << 7)
+#define HMM_PFN_UNADDRESSABLE (1 << 8)
+#define HMM_PFN_SHIFT 9
 
 static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
 {
@@ -298,6 +300,29 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     hmm_pfn_t *pfns);
 
 
+/*
+ * Fault memory on behalf of device driver unlike handle_mm_fault() it will not
+ * migrate any device memory back to system memory. The hmm_pfn_t array will be
+ * updated with fault result and current snapshot of the CPU page table for the
+ * range. Note that you must use hmm_range_monitor_start/end() to ascertain if
+ * you could use those.
+ *
+ * DO NOT USE hmm_vma_range_lock()/hmm_vma_range_unlock() IT WILL DEADLOCK !
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be drop
+ * by the function if that happen the function return false. Otherwise, if the
+ * mmap_sem is still held it return true. The return value does not reflect if
+ * the fault was successfull or not, you need to inspect the hmm_pfn_t array to
+ * determine fault status.
+ *
+ * See function description in mm/hmm.c for documentation.
+ */
+bool hmm_vma_fault(struct vm_area_struct *vma,
+		   unsigned long start,
+		   unsigned long end,
+		   hmm_pfn_t *pfns);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index f2ea76b..521adfd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -461,6 +461,14 @@ bool hmm_vma_range_monitor_end(struct hmm_range *range)
 EXPORT_SYMBOL(hmm_vma_range_monitor_end);
 
 
+static void hmm_pfns_error(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_ERROR;
+}
+
 static void hmm_pfns_empty(hmm_pfn_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
@@ -477,10 +485,47 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
 		*pfns = HMM_PFN_SPECIAL;
 }
 
-static void hmm_vma_walk(struct vm_area_struct *vma,
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	unsigned long npfns = (end - addr) >> PAGE_SHIFT;
+
+	memset(pfns, 0, sizeof(*pfns) * npfns);
+}
+
+static bool hmm_pfns_fault(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		if (*pfns & HMM_PFN_FAULT)
+			return true;
+	return false;
+}
+
+static bool hmm_vma_do_fault(struct vm_area_struct *vma,
+			     unsigned long addr,
+			     hmm_pfn_t *pfn)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	int r;
+
+	flags |= (*pfn & HMM_PFN_WRITE) ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return false;
+	if (r & VM_FAULT_ERROR)
+		*pfn = HMM_PFN_ERROR;
+
+	return true;
+}
+
+static bool hmm_vma_walk(struct vm_area_struct *vma,
 			 unsigned long start,
 			 unsigned long end,
-			 hmm_pfn_t *pfns)
+			 hmm_pfn_t *pfns,
+			 bool fault)
 {
 	unsigned long addr, next;
 	hmm_pfn_t flag;
@@ -489,6 +534,7 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 
 	for (addr = start; addr < end; addr = next) {
 		unsigned long i = (addr - start) >> PAGE_SHIFT;
+		bool writefault = false;
 		pgd_t *pgdp;
 		pud_t *pudp;
 		pmd_t *pmdp;
@@ -504,15 +550,37 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset(vma->vm_mm, addr);
 		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault || !hmm_pfns_fault(&pfns[i], addr, next)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pudp = pud_alloc(vma->vm_mm, pgdp, addr);
+			if (!pudp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pud_addr_end(addr, end);
 		pudp = pud_offset(pgdp, addr);
 		if (pud_none(*pudp) || pud_bad(*pudp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault || !hmm_pfns_fault(&pfns[i], addr, next)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pmdp = pmd_alloc(vma->vm_mm, pudp, addr);
+			if (!pmdp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pmd_addr_end(addr, end);
@@ -520,8 +588,23 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd = pmd_read_atomic(pmdp);
 		barrier();
 		if (pmd_none(pmd) || pmd_bad(pmd)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault || !hmm_pfns_fault(&pfns[i], addr, next)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			/*
+			 * Use pte_alloc() instead of pte_alloc_map, because we
+			 * can't run pte_offset_map on the pmd, if an huge pmd
+			 * could materialize from under us.
+			 */
+			if (unlikely(pte_alloc(vma->vm_mm, pmdp, addr))) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
@@ -529,12 +612,33 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 
 			if (pmd_protnone(pmd)) {
 				hmm_pfns_clear(&pfns[i], addr, next);
+				if (!fault || !(vma->vm_flags & VM_READ))
+					continue;
+				if (!hmm_pfns_fault(&pfns[i], addr, next))
+					continue;
+
+				if (!hmm_vma_do_fault(vma, addr, &pfns[i]))
+					return false;
+				/* Start again for current address */
+				next = addr;
 				continue;
 			}
 			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
 			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
-			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
+			for (; addr < next; addr += PAGE_SIZE, i++, pfn++) {
+				bool fault = pfns[i] & HMM_PFN_FAULT;
+				bool write = pfns[i] & HMM_PFN_WRITE;
+
 				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
+				if (!fault || !write || flags & HMM_PFN_WRITE)
+					continue;
+				pfns[i] = HMM_PFN_FAULT | HMM_PFN_WRITE;
+				if (!hmm_vma_do_fault(vma, addr, &pfns[i]))
+					return false;
+				/* Start again for current address */
+				next = addr;
+				break;
+			}
 			continue;
 		}
 
@@ -543,41 +647,91 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t pte = *ptep;
 
-			pfns[i] = 0;
-
 			if (pte_none(pte)) {
-				pfns[i] = HMM_PFN_EMPTY;
-				continue;
+				if (!fault || !(pfns[i] & HMM_PFN_FAULT)) {
+					pfns[i] = HMM_PFN_EMPTY;
+					continue;
+				}
+				if (!(vma->vm_flags & VM_READ)) {
+					pfns[i] = HMM_PFN_EMPTY;
+					continue;
+				}
+				if (!hmm_vma_do_fault(vma, addr, &pfns[i])) {
+					hmm_pfns_clear(&pfns[i], addr, end);
+					pte_unmap(ptep);
+					return false;
+				}
+				pte = *ptep;
 			}
 
 			entry = pte_to_swp_entry(pte);
 			if (!pte_present(pte) && !non_swap_entry(entry)) {
-				continue;
+				if (!fault || !(pfns[i] & HMM_PFN_FAULT)) {
+					pfns[i] = 0;
+					continue;
+				}
+				if (!(vma->vm_flags & VM_READ)) {
+					pfns[i] = 0;
+					continue;
+				}
+				if (!hmm_vma_do_fault(vma, addr, &pfns[i])) {
+					hmm_pfns_clear(&pfns[i], addr, end);
+					pte_unmap(ptep);
+					return false;
+				}
+				pte = *ptep;
 			}
 
+			writefault = (pfns[i]&(HMM_PFN_WRITE|HMM_PFN_FAULT)) ==
+				     (HMM_PFN_WRITE|HMM_PFN_FAULT) && fault;
+
 			if (pte_present(pte)) {
 				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
 				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-				continue;
-			}
-
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			*/
-			if (is_device_entry(entry)) {
+			} else if (is_device_entry(entry)) {
+				/* Do not fault device entry */
 				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_entry(entry))
 					pfns[i] |= HMM_PFN_WRITE;
 				pfns[i] |= HMM_PFN_DEVICE;
 				pfns[i] |= HMM_PFN_UNADDRESSABLE;
 				pfns[i] |= flag;
-			} else if (!is_migration_entry(entry)) {
+			} else if (is_migration_entry(entry) && fault) {
+				migration_entry_wait(vma->vm_mm, pmdp, addr);
+				/* Start again for current address */
+				next = addr;
+				ptep++;
+				break;
+			} else {
+				/* Report error for everything else */
 				pfns[i] = HMM_PFN_ERROR;
 			}
+			if (!(vma->vm_flags & VM_READ) ||
+			    !(vma->vm_flags & VM_WRITE)) {
+				writefault = false;
+				continue;
+			}
+
+			if (writefault && !(pfns[i] & HMM_PFN_WRITE)) {
+				ptep++;
+				break;
+			}
+			writefault = false;
 		}
 		pte_unmap(ptep - 1);
+
+		if (writefault && (vma->vm_flags & VM_WRITE)) {
+			pfns[i] = HMM_PFN_WRITE | HMM_PFN_FAULT;
+			if (!hmm_vma_do_fault(vma, addr, &pfns[i])) {
+				return false;
+			}
+			writefault = false;
+			/* Start again for current address */
+			next = addr;
+		}
 	}
+
+	return true;
 }
 
 /*
@@ -613,7 +767,67 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	if (end < vma->vm_start || end > vma->vm_end)
 		return -EINVAL;
 
-	hmm_vma_walk(vma, start, end, pfns);
+	hmm_vma_walk(vma, start, end, pfns, false);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * Returns: true mmap_sem is still held, false mmap_sem have been release
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * Only pfn with fault flag set will be faulted and the hmm_pfn_t write flag
+ * will be use to determine if it is a write fault or not.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ *   retry:
+ *      down_read(&mm->mmap_sem);
+ *      // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *      // array accordingly
+ *      hmm_vma_range_monitor_start(range, vma, start, end);
+ *      if (!hmm_vma_fault(vma, start, end, pfns, allow_retry)) {
+ *          hmm_vma_range_monitor_end(range);
+ *          // You might want to rate limit or yield to play nicely, you may
+ *          // also commit any valid pfn in the array assuming that you are
+ *          // getting true from hmm_vma_range_monitor_end()
+ *          goto retry;
+ *      }
+ *      // Take device driver lock that serialize device page table update
+ *      driver_lock_device_page_table_update();
+ *      if (hmm_vma_range_monitor_end(range)) {
+ *          // Commit pfns we got from hmm_vma_fault()
+ *      }
+ *      driver_unlock_device_page_table_update();
+ *      up_read(&mm->mmap_sem)
+ */
+bool hmm_vma_fault(struct vm_area_struct *vma,
+		   unsigned long start,
+		   unsigned long end,
+		   hmm_pfn_t *pfns)
+{
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return true;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return true;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return true;
+
+	return hmm_vma_walk(vma, start, end, pfns, true);
+}
+EXPORT_SYMBOL(hmm_vma_fault);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 14/18] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 13/18] mm/hmm/mirror: device page fault handler Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 15/18] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/migrate.c | 11 ++++++++++-
 mm/rmap.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 66ce6b4..6b6b457 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -40,6 +40,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -248,7 +249,15 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		pte = arch_make_huge_pte(pte, vma, new, 0);
 	}
 #endif
-	flush_dcache_page(new);
+
+	if (unlikely(is_zone_device_page(new)) && !is_addressable_page(new)) {
+		entry = make_device_entry(new, pte_write(pte));
+		pte = swp_entry_to_pte(entry);
+		if (pte_swp_soft_dirty(*ptep))
+			pte = pte_mksoft_dirty(pte);
+	} else
+		flush_dcache_page(new);
+
 	set_pte_at(mm, addr, ptep, pte);
 
 	if (PageHuge(new)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 1ef3640..fff3578 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -61,6 +61,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1455,6 +1456,52 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			goto out;
 	}
 
+	if ((flags & TTU_MIGRATION) && is_zone_device_page(page)) {
+		swp_entry_t entry;
+		pte_t swp_pte;
+		pmd_t *pmdp;
+
+		if (!(page->pgmap->flags & MEMORY_MOVABLE))
+			goto out;
+
+		pmdp = mm_find_pmd(mm, address);
+		if (!pmdp)
+			goto out;
+
+		pte = pte_offset_map_lock(mm, pmdp, address, &ptl);
+		if (!pte)
+			goto out;
+
+		pteval = ptep_get_and_clear(mm, address, pte);
+		if (pte_present(pteval) || pte_none(pteval)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		entry = pte_to_swp_entry(pteval);
+		if (!is_device_entry(entry)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		if (device_entry_to_page(entry) != page) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		entry = make_migration_entry(page, 0);
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(*pte))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(mm, address, pte, swp_pte);
+		goto discard;
+	}
+
 	pte = page_check_address(page, mm, address, &ptl,
 				 PageTransCompound(page));
 	if (!pte)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 15/18] mm/hmm/migrate: add new boolean copy flag to migratepage() callback
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 14/18] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Allow migration without copy in case destination page already have
source page content. This is usefull for HMM migration to device
where we copy page before doing the final migration step.

This feature need carefull audit of filesystem code to make sure
that no one can write to the source page while it is unmapped and
locked. It should be safe for most filesystem but as precaution
return error until support for device migration is added to them.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/staging/lustre/lustre/llite/rw26.c |  8 +++--
 fs/aio.c                                   |  7 +++-
 fs/btrfs/disk-io.c                         | 11 ++++--
 fs/hugetlbfs/inode.c                       |  9 +++--
 fs/nfs/internal.h                          |  5 +--
 fs/nfs/write.c                             |  9 +++--
 fs/ubifs/file.c                            |  8 ++++-
 include/linux/balloon_compaction.h         |  3 +-
 include/linux/fs.h                         | 13 ++++---
 include/linux/migrate.h                    |  7 ++--
 mm/balloon_compaction.c                    |  2 +-
 mm/migrate.c                               | 56 +++++++++++++++++++-----------
 12 files changed, 95 insertions(+), 43 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index d98c7ac..e163d43 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -43,6 +43,7 @@
 #include <linux/uaccess.h>
 
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/mpage.h>
@@ -643,9 +644,12 @@ static int ll_write_end(struct file *file, struct address_space *mapping,
 #ifdef CONFIG_MIGRATION
 static int ll_migratepage(struct address_space *mapping,
 			  struct page *newpage, struct page *page,
-			  enum migrate_mode mode
-		)
+			  enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/* Always fail page migration until we have a proper implementation */
 	return -EIO;
 }
diff --git a/fs/aio.c b/fs/aio.c
index 4fe81d1..416c7ef 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -37,6 +37,7 @@
 #include <linux/blkdev.h>
 #include <linux/compat.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
@@ -363,13 +364,17 @@ static const struct file_operations aio_ring_fops = {
 
 #if IS_ENABLED(CONFIG_MIGRATION)
 static int aio_migratepage(struct address_space *mapping, struct page *new,
-			struct page *old, enum migrate_mode mode)
+			   struct page *old, enum migrate_mode mode, bool copy)
 {
 	struct kioctx *ctx;
 	unsigned long flags;
 	pgoff_t idx;
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(new))
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54bc8c7..9a29aa5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -27,6 +27,7 @@
 #include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ratelimit.h>
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
@@ -1023,9 +1024,13 @@ out_w_error:
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page,
-			enum migrate_mode mode)
+			     struct page *newpage, struct page *page,
+			     enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * we can't safely write a btree page from here,
 	 * we haven't done the locking hook
@@ -1039,7 +1044,7 @@ static int btree_migratepage(struct address_space *mapping,
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7337cac..de77e6f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -35,6 +35,7 @@
 #include <linux/security.h>
 #include <linux/magic.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/uio.h>
 
 #include <asm/uaccess.h>
@@ -842,11 +843,15 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 }
 
 static int hugetlbfs_migrate_page(struct address_space *mapping,
-				struct page *newpage, struct page *page,
-				enum migrate_mode mode)
+				  struct page *newpage, struct page *page,
+				  enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index da9e558..db1c2ad 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -537,8 +537,9 @@ void nfs_clear_pnfs_ds_commit_verifiers(struct pnfs_ds_commit_info *cinfo)
 
 
 #ifdef CONFIG_MIGRATION
-extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *, enum migrate_mode);
+extern int nfs_migrate_page(struct address_space *mapping,
+			    struct page *newpage, struct page *page,
+			    enum migrate_mode, bool copy);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5321183..d7130a5 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 #include <linux/sunrpc/clnt.h>
 #include <linux/nfs_fs.h>
@@ -2023,8 +2024,12 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		     struct page *page, enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
 	 * an in-progress read or write request. Don't try to migrate it.
@@ -2039,7 +2044,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 	if (!nfs_fscache_release_page(page, GFP_KERNEL))
 		return -EBUSY;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 7bbf420..57bff28 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -53,6 +53,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 static int read_block(struct inode *inode, void *addr, unsigned int block,
 		      struct ubifs_data_node *dn)
@@ -1455,10 +1456,15 @@ static int ubifs_set_page_dirty(struct page *page)
 
 #ifdef CONFIG_MIGRATION
 static int ubifs_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			      struct page *newpage, struct page *page,
+			      enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 79542b2..27cf3e3 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -85,7 +85,8 @@ extern bool balloon_page_isolate(struct page *page,
 extern void balloon_page_putback(struct page *page);
 extern int balloon_page_migrate(struct address_space *mapping,
 				struct page *newpage,
-				struct page *page, enum migrate_mode mode);
+				struct page *page, enum migrate_mode mode,
+				bool copy);
 
 /*
  * balloon_page_insert - insert a page into the balloon's page list and make
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02bc78e..a54d164 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -396,8 +396,9 @@ struct address_space_operations {
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
 	 */
-	int (*migratepage) (struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+	int (*migratepage)(struct address_space *mapping,
+			   struct page *newpage, struct page *page,
+			   enum migrate_mode, bool copy);
 	bool (*isolate_page)(struct page *, isolate_mode_t);
 	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
@@ -2989,9 +2990,11 @@ extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
 extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *,
-				enum migrate_mode);
+extern int buffer_migrate_page(struct address_space *mapping,
+			       struct page *newpage,
+			       struct page *page,
+			       enum migrate_mode,
+			       bool copy);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ae8d475..37b77ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -33,8 +33,11 @@ extern char *migrate_reason_names[MR_TYPES];
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
-extern int migrate_page(struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+extern int migrate_page(struct address_space *mapping,
+			struct page *newpage,
+			struct page *page,
+			enum migrate_mode,
+			bool copy);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
 extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..ed5cacb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -135,7 +135,7 @@ void balloon_page_putback(struct page *page)
 /* move_to_new_page() counterpart for a ballooned page */
 int balloon_page_migrate(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		enum migrate_mode mode, bool copy)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 6b6b457..d9ce8db 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -622,18 +622,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 	}
 }
 
-/*
- * Copy the page to its new location
- */
-void migrate_page_copy(struct page *newpage, struct page *page)
+static void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -687,6 +679,19 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -700,8 +705,8 @@ EXPORT_SYMBOL(migrate_page_copy);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		 struct page *newpage, struct page *page,
+		 enum migrate_mode mode, bool copy)
 {
 	int rc;
 
@@ -712,7 +717,11 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
+
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -724,13 +733,14 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			struct page *newpage, struct page *page,
+			enum migrate_mode mode, bool copy)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page, mode);
+		return migrate_page(mapping, newpage, page, mode, copy);
 
 	head = page_buffers(page);
 
@@ -762,12 +772,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -822,7 +835,8 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page, enum migrate_mode mode)
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
@@ -839,7 +853,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, true);
 }
 
 /*
@@ -867,7 +881,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	if (likely(is_lru)) {
 		if (!mapping)
-			rc = migrate_page(mapping, newpage, page, mode);
+			rc = migrate_page(mapping, newpage, page, mode, true);
 		else if (mapping->a_ops->migratepage)
 			/*
 			 * Most pages have a mapping and most filesystems
@@ -877,7 +891,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 			 * for page migration.
 			 */
 			rc = mapping->a_ops->migratepage(mapping, newpage,
-							page, mode);
+							page, mode, true);
 		else
 			rc = fallback_migrate_page(mapping, newpage,
 							page, mode);
@@ -894,7 +908,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		}
 
 		rc = mapping->a_ops->migratepage(mapping, newpage,
-						page, mode);
+						page, mode, true);
 		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
 			!PageIsolated(page));
 	}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 15/18] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 19:57   ` Aneesh Kumar K.V
                     ` (2 more replies)
  2016-11-18 18:18 ` [HMM v13 17/18] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
                   ` (3 subsequent siblings)
  19 siblings, 3 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be usefull for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  54 ++++-
 mm/migrate.c        | 584 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 635 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index c79abfc..9777309 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -101,10 +101,13 @@ struct hmm;
  * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
  * HMM_PFN_FAULT: use by hmm_vma_fault() to signify which address need faulting
  * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_LOCKED: underlying struct page is lock
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
  *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
  *      device (the entry will never have HMM_PFN_VALID set and the pfn value
  *      is undefine)
+ * HMM_PFN_MIGRATE: use by hmm_vma_migrate() to signify which address can be
+ *      migrated
  * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
@@ -116,9 +119,11 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_EMPTY (1 << 4)
 #define HMM_PFN_FAULT (1 << 5)
 #define HMM_PFN_DEVICE (1 << 6)
-#define HMM_PFN_SPECIAL (1 << 7)
-#define HMM_PFN_UNADDRESSABLE (1 << 8)
-#define HMM_PFN_SHIFT 9
+#define HMM_PFN_LOCKED (1 << 7)
+#define HMM_PFN_SPECIAL (1 << 8)
+#define HMM_PFN_MIGRATE (1 << 9)
+#define HMM_PFN_UNADDRESSABLE (1 << 10)
+#define HMM_PFN_SHIFT 11
 
 static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
 {
@@ -323,6 +328,49 @@ bool hmm_vma_fault(struct vm_area_struct *vma,
 		   hmm_pfn_t *pfns);
 
 
+/*
+ * struct hmm_migrate_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memoiry and copy source to it
+ * @finalize_and_map: allow caller to inspect successfull migrated page
+ *
+ * The new HMM migrate helper hmm_vma_migrate() allow memory migration to use
+ * device DMA engine to perform copy from source to destination memory it also
+ * allow caller to use its own memory allocator for destination memory.
+ *
+ * Note that in alloc_and_copy device driver can decide not to migrate some of
+ * the entry, for those it must clear the HMM_PFN_MIGRATE flag. The destination
+ * page must lock and the corresponding hmm_pfn_t value in the array updated
+ * with the HMM_PFN_MIGRATE and HMM_PFN_LOCKED flag set (and of course be a
+ * valid entry). It is expected that the page allocated will have an elevated
+ * refcount and that a put_page() will free the page. Device driver might want
+ * to allocate with an extra-refcount if they want to control deallocation of
+ * failed migration inside the finalize_and_map() callback.
+ *
+ * Inside finalize_and_map() device driver must use the HMM_PFN_MIGRATE flag to
+ * determine which page have been successfully migrated.
+ */
+struct hmm_migrate_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       unsigned long start,
+			       unsigned long end,
+			       hmm_pfn_t *pfns,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 unsigned long start,
+				 unsigned long end,
+				 hmm_pfn_t *pfns,
+				 void *private);
+};
+
+int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
+		    struct vm_area_struct *vma,
+		    unsigned long start,
+		    unsigned long end,
+		    hmm_pfn_t *pfns,
+		    void *private);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index d9ce8db..393d592 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
 #include <linux/memremap.h>
+#include <linux/hmm.h>
 
 #include <asm/tlbflush.h>
 
@@ -421,6 +422,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_MOVABLE flag set (see include/linux/memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2087,3 +2096,578 @@ out_unlock:
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+#if defined(CONFIG_HMM)
+struct hmm_migrate {
+	struct vm_area_struct	*vma;
+	unsigned long		start;
+	unsigned long		end;
+	unsigned long		npages;
+	hmm_pfn_t		*pfns;
+};
+
+static int hmm_collect_walk_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_migrate *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	spinlock_t *ptl;
+	hmm_pfn_t *pfns;
+	int pages = 0;
+	pte_t *ptep;
+
+again:
+	if (pmd_none(*pmdp))
+		return 0;
+
+	split_huge_pmd(walk->vma, pmdp, addr);
+	if (pmd_trans_unstable(pmdp))
+		goto again;
+
+	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+
+	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
+		unsigned long pfn;
+		swp_entry_t entry;
+		struct page *page;
+		hmm_pfn_t flags;
+		bool write;
+		pte_t pte;
+
+		pte = ptep_get_and_clear(mm, addr, ptep);
+		if (!pte_present(pte)) {
+			if (pte_none(pte))
+				continue;
+
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_entry(entry)) {
+				set_pte_at(mm, addr, ptep, pte);
+				continue;
+			}
+
+			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
+			page = device_entry_to_page(entry);
+			write = is_write_device_entry(entry);
+			pfn = page_to_pfn(page);
+
+			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
+				set_pte_at(mm, addr, ptep, pte);
+				continue;
+			}
+
+		} else {
+			pfn = pte_pfn(pte);
+			page = pfn_to_page(pfn);
+			write = pte_write(pte);
+			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
+		}
+
+		/* FIXME support THP see hmm_migrate_page_check() */
+		if (PageTransCompound(page))
+			continue;
+
+		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
+		*pfns |= write ? HMM_PFN_WRITE : 0;
+		migrate->npages++;
+		get_page(page);
+
+		if (!trylock_page(page)) {
+			set_pte_at(mm, addr, ptep, pte);
+		} else {
+			pte_t swp_pte;
+
+			*pfns |= HMM_PFN_LOCKED;
+
+			entry = make_migration_entry(page, write);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			page_remove_rmap(page, false);
+			put_page(page);
+			pages++;
+		}
+	}
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	/* Only flush the TLB if we actually modified any entries */
+	if (pages)
+		flush_tlb_range(walk->vma, start, end);
+
+	return 0;
+}
+
+static void hmm_migrate_collect(struct hmm_migrate *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = hmm_collect_walk_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
+}
+
+static inline bool hmm_migrate_page_check(struct page *page, int extra)
+{
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them then regular page because they can be map with a pmd or
+	 * with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	if (is_zone_device_page(page))
+		extra++;
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+	struct mm_struct *mm = migrate->vma->vm_mm;
+	struct vm_area_struct *vma = migrate->vma;
+	unsigned long restore = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+again:
+	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
+
+		if (!page)
+			continue;
+
+		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
+			lock_page(page);
+			migrate->pfns[i] |= HMM_PFN_LOCKED;
+		}
+
+		/* ZONE_DEVICE page are not on LRU */
+		if (is_zone_device_page(page))
+			goto check;
+
+		if (!PageLRU(page) && allow_drain) {
+			/* Drain CPU's pagevec so page can be isolated */
+			lru_add_drain_all();
+			allow_drain = false;
+			goto again;
+		}
+
+		if (isolate_lru_page(page)) {
+			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
+			migrate->npages--;
+			put_page(page);
+			restore++;
+		} else
+			/* Drop the reference we took in collect */
+			put_page(page);
+
+check:
+		if (!hmm_migrate_page_check(page, 1)) {
+			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
+			migrate->npages--;
+			restore++;
+		}
+	}
+
+	if (!restore)
+		return;
+
+	for (addr = migrate->start, i = 0; addr < migrate->end;) {
+		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
+		unsigned long next, restart;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE)) {
+			addr += PAGE_SIZE;
+			i++;
+			continue;
+		}
+
+		restart = addr;
+		pgdp = pgd_offset(mm, addr);
+		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
+			addr = pgd_addr_end(addr, migrate->end);
+			i = (addr - migrate->start) >> PAGE_SHIFT;
+			continue;
+		}
+		pudp = pud_offset(pgdp, addr);
+		if (!pudp || pud_none(*pudp)) {
+			addr = pgd_addr_end(addr, migrate->end);
+			i = (addr - migrate->start) >> PAGE_SHIFT;
+			continue;
+		}
+		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(addr, migrate->end);
+		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
+			addr = next;
+			i = (addr - migrate->start) >> PAGE_SHIFT;
+			continue;
+		}
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			bool write;
+			pte_t pte;
+
+			page = hmm_pfn_to_page(migrate->pfns[i]);
+			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+
+			write = migrate->pfns[i] & HMM_PFN_WRITE;
+			write &= (vma->vm_flags & VM_WRITE);
+
+			/* Here it means pte must be a valid migration entry */
+			pte = ptep_get_and_clear(mm, addr, ptep);
+			if (pte_none(pte) || pte_present(pte))
+				/* SOMETHING BAD IS GOING ON ! */
+				continue;
+			entry = pte_to_swp_entry(pte);
+			if (!is_migration_entry(entry))
+				/* SOMETHING BAD IS GOING ON ! */
+				continue;
+
+			if (is_zone_device_page(page) &&
+			    !is_addressable_page(page)) {
+				entry = make_device_entry(page, write);
+				pte = swp_entry_to_pte(entry);
+			} else {
+				pte = mk_pte(page, vma->vm_page_prot);
+				pte = pte_mkold(pte);
+				if (write)
+					pte = pte_mkwrite(pte);
+			}
+			if (pte_swp_soft_dirty(*ptep))
+				pte = pte_mksoft_dirty(pte);
+
+			get_page(page);
+			set_pte_at(mm, addr, ptep, pte);
+			if (PageAnon(page))
+				page_add_anon_rmap(page, vma, addr, false);
+			else
+				page_add_file_rmap(page, false);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+		addr = restart;
+		i = (addr - migrate->start) >> PAGE_SHIFT;
+		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
+			page = hmm_pfn_to_page(migrate->pfns[i]);
+			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+
+			migrate->pfns[i] = 0;
+			unlock_page(page);
+			restore--;
+
+			if (is_zone_device_page(page)) {
+				put_page(page);
+				continue;
+			}
+
+			putback_lru_page(page);
+		}
+
+		if (!restore)
+			break;
+	}
+}
+
+static void hmm_migrate_unmap(struct hmm_migrate *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	unsigned long addr = migrate->start, i = 0, restore = 0;
+
+	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
+
+		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !hmm_migrate_page_check(page, 1)) {
+			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
+			migrate->npages--;
+			restore++;
+		}
+	}
+
+	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
+
+		if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->pfns[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		if (is_zone_device_page(page)) {
+			put_page(page);
+			continue;
+		}
+
+		putback_lru_page(page);
+	}
+}
+
+static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+	struct mm_struct *mm = migrate->vma->vm_mm;
+
+	for (; addr < migrate->end;) {
+		unsigned long next;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		pgdp = pgd_offset(mm, addr);
+		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
+			addr = pgd_addr_end(addr, migrate->end);
+			i = (addr - migrate->start) >> PAGE_SHIFT;
+			continue;
+		}
+		pudp = pud_offset(pgdp, addr);
+		if (!pudp || pud_none(*pudp)) {
+			addr = pgd_addr_end(addr, migrate->end);
+			i = (addr - migrate->start) >> PAGE_SHIFT;
+			continue;
+		}
+		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(addr, migrate->end);
+		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
+			addr = next;
+			i = (addr - migrate->start) >> PAGE_SHIFT;
+			continue;
+		}
+
+		/* No need to lock nothing can change from under us */
+		ptep = pte_offset_map(pmdp, addr);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			struct address_space *mapping;
+			struct page *newpage, *page;
+			swp_entry_t entry;
+			int r;
+
+			newpage = hmm_pfn_to_page(migrate->pfns[i]);
+			if (!newpage || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+			if (pte_none(*ptep) || pte_present(*ptep)) {
+				/* This should not happen but be nice */
+				migrate->pfns[i] = 0;
+				put_page(newpage);
+				continue;
+			}
+			entry = pte_to_swp_entry(*ptep);
+			if (!is_migration_entry(entry)) {
+				/* This should not happen but be nice */
+				migrate->pfns[i] = 0;
+				put_page(newpage);
+				continue;
+			}
+
+			page = migration_entry_to_page(entry);
+			mapping = page_mapping(page);
+
+			/*
+			 * For now only support private anonymous when migrating
+			 * to un-addressable device memory.
+			 */
+			if (mapping && is_zone_device_page(newpage) &&
+			    !is_addressable_page(newpage)) {
+				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
+				continue;
+			}
+
+			r = migrate_page(mapping, newpage, page,
+					 MIGRATE_SYNC, false);
+			if (r != MIGRATEPAGE_SUCCESS)
+				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
+		}
+		pte_unmap(ptep - 1);
+	}
+}
+
+static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+	struct mm_struct *mm = migrate->vma->vm_mm;
+
+	for (; addr < migrate->end;) {
+		unsigned long next;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		pgdp = pgd_offset(mm, addr);
+		pudp = pud_offset(pgdp, addr);
+		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(addr, migrate->end);
+
+		/* No need to lock nothing can change from under us */
+		ptep = pte_offset_map(pmdp, addr);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			struct page *page, *newpage;
+			swp_entry_t entry;
+
+			if (pte_none(*ptep) || pte_present(*ptep))
+				continue;
+			entry = pte_to_swp_entry(*ptep);
+			if (!is_migration_entry(entry))
+				continue;
+
+			page = migration_entry_to_page(entry);
+			newpage = hmm_pfn_to_page(migrate->pfns[i]);
+			if (!newpage)
+				newpage = page;
+			remove_migration_ptes(page, newpage, false);
+
+			migrate->pfns[i] = 0;
+			unlock_page(page);
+			migrate->npages--;
+
+			if (is_zone_device_page(page))
+				put_page(page);
+			else
+				putback_lru_page(page);
+
+			if (newpage != page) {
+				unlock_page(newpage);
+				if (is_zone_device_page(newpage))
+					put_page(newpage);
+				else
+					putback_lru_page(newpage);
+			}
+		}
+		pte_unmap(ptep - 1);
+	}
+}
+
+/*
+ * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @pfns: array of hmm_pfn_t first containing source pfns then destination
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This will try to migrate a range of memory using callback to allocate and
+ * copy memory from source to destination. This function will first collect,
+ * lock and unmap pages in the range and then call alloc_and_copy() callback
+ * for device driver to allocate destination memory and copy from source.
+ *
+ * Then it will proceed and try to effectively migrate the page (struct page
+ * metadata) a step that can fail for various reasons. Before updating CPU page
+ * table it will call finalize_and_map() callback so that device driver can
+ * inspect what have been successfully migrated and update its own page table
+ * (this latter aspect is not mandatory and only make sense for some user of
+ * this API).
+ *
+ * Finaly the function update CPU page table and unlock the pages before
+ * returning 0.
+ *
+ * It will return an error code only if one of the argument is invalid.
+ */
+int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
+		    struct vm_area_struct *vma,
+		    unsigned long start,
+		    unsigned long end,
+		    hmm_pfn_t *pfns,
+		    void *private)
+{
+	struct hmm_migrate migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (!vma || !ops || !pfns || start >= end)
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	migrate.start = start;
+	migrate.pfns = pfns;
+	migrate.npages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	hmm_migrate_collect(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/* Lock and isolate page */
+	hmm_migrate_lock_and_isolate(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/* Unmap pages */
+	hmm_migrate_unmap(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/*
+	 * At this point pages are lock and unmap and thus they have stable
+	 * content and can safely be copied to destination memory that is
+	 * allocated by the callback.
+	 *
+	 * Note that migration can fail in hmm_migrate_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, start, end, pfns, private);
+
+	/* This does the real migration of struct page */
+	hmm_migrate_struct_page(&migrate);
+
+	ops->finalize_and_map(vma, start, end, pfns, private);
+
+	/* Unlock and remap pages */
+	hmm_migrate_remove_migration_pte(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_migrate);
+#endif /* CONFIG_HMM */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 17/18] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (15 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-18 18:18 ` [HMM v13 18/18] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 113 ++++++++++++++++++++++++
 mm/hmm.c            | 247 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 360 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 9777309..ac0b69a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,6 +88,10 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -327,6 +331,9 @@ bool hmm_vma_fault(struct vm_area_struct *vma,
 		   unsigned long end,
 		   hmm_pfn_t *pfns);
 
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
 
 /*
  * struct hmm_migrate_ops - migrate operation callback
@@ -371,6 +378,112 @@ int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
 		    void *private);
 
 
+struct hmm_devmem;
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ * @inuse: is struct in use
+ *
+ * This an helper structure for device driver that do not wish to implement
+ * to gory details related to hotpluging new memoy and in allocating struct
+ * pages.
+ *
+ * Device driver can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct dev_pagemap		*pagemap;
+	struct device			*device;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+	bool				inuse;
+};
+
+/*
+ * To add (hotplug) device memory, it assumes that there is no real resource
+ * that reserve a range in the physical address space (this is intended to be
+ * use by un-addressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * Device driver can wrap the hmm_devmem struct inside a private device driver
+ * struct. Device driver must call hmm_devmem_remove() before device goes away
+ * and before freeing the hmm_devmem struct memory.
+ */
+int hmm_devmem_add(struct hmm_devmem *devmem,
+		   const struct hmm_devmem_ops *ops,
+		   struct device *device,
+		   unsigned long size);
+bool hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct hmm_migrate_ops *ops,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   hmm_pfn_t *pfns,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 521adfd..f2ca895 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/mmu_notifier.h>
 
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
 
 /*
  * struct hmm - HMM per mm struct
@@ -831,3 +836,245 @@ bool hmm_vma_fault(struct vm_area_struct *vma,
 	return hmm_vma_walk(vma, start, end, pfns, true);
 }
 EXPORT_SYMBOL(hmm_vma_fault);
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+	devmem->inuse = false;
+}
+
+static void hmm_devmem_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, hmm_devmem_exit, data);
+}
+
+static void hmm_devmem_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	devmem->inuse = false;
+	percpu_ref_kill(ref);
+	devm_remove_action(devmem->device, hmm_devmem_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+/*
+ * hmm_devmem_add() - hotplug fake ZONE_DEVICE memory for device memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: 0 on success, error code otherwise
+ *
+ * This first find an empty range of physical address big enough to for the new
+ * resource and then hotplug it as ZONE_DEVICE memory allocating struct page.
+ * It does not do anything beside that, all events affecting the memory will go
+ * through the various callback provided by hmm_devmem_ops struct.
+ */
+int hmm_devmem_add(struct hmm_devmem *devmem,
+		   const struct hmm_devmem_ops *ops,
+		   struct device *device,
+		   unsigned long size)
+{
+	const struct resource *res;
+	resource_size_t addr;
+	void *ptr;
+	int ret;
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->pagemap = NULL;
+	devmem->inuse = false;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref,&hmm_devmem_release,0,GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	ret = devm_add_action(device, hmm_devmem_exit, &devmem->ref);
+	if (ret)
+		goto error;
+
+	size = ALIGN(size, SECTION_SIZE);
+	addr = (1UL << MAX_PHYSMEM_BITS) - size;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error;
+		}
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error;
+	}
+
+	ptr = devm_memremap_pages(device, devmem->resource, &devmem->ref,
+				  NULL, &devmem->pagemap,
+				  MEMORY_DEVICE | MEMORY_MOVABLE |
+				  MEMORY_UNADDRESSABLE);
+	if (IS_ERR(ptr)) {
+		ret = PTR_ERR(ptr);
+		goto error;
+	}
+
+	ret = devm_add_action(device, hmm_devmem_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_kill(&devmem->ref);
+		goto error;
+	}
+
+	res = devmem->pagemap->res;
+	devmem->pfn_first = res->start >> PAGE_SHIFT;
+	devmem->pfn_last = (resource_size(res)>>PAGE_SHIFT)+devmem->pfn_first;
+	devmem->pagemap->free_devpage = hmm_devmem_free;
+	devmem->pagemap->fault = hmm_devmem_fault;
+	devmem->pagemap->data = devmem;
+	devmem->inuse = true;
+
+	return 0;
+
+error:
+	hmm_devmem_exit(&devmem->ref);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * Returns: true if device memory is no longer in use, false if still in use
+ *
+ * This will hot remove memory that was hotplug by hmm_devmem_add on behalf of
+ * device driver. It will free struct page and remove the resource that reserve
+ * the physical address range for this device memory.
+ *
+ * Device driver can not free the struct while this function return false, it
+ * must call over and over this function until it returns true. Note that if
+ * there is a refcount bug this might never happen !
+ */
+bool hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	struct device *device = devmem->device;
+
+	hmm_devmem_kill(&devmem->ref);
+
+	if (devmem->pagemap) {
+		devm_memremap_pages_remove(device, devmem->pagemap);
+		devmem->pagemap = NULL;
+	}
+
+	hmm_devmem_exit(&devmem->ref);
+
+	/* FIXME maybe wait a bit ? */
+	if (devmem->inuse)
+		return false;
+
+	if (devmem->resource) {
+		resource_size_t size = resource_size(devmem->resource);
+
+		devm_release_mem_region(device, devmem->resource->start, size);
+		devmem->resource = NULL;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @pfns: array of hmm_pfn_t first containing source pfns then destination
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around hmm_vma_migrate() which check the migration status
+ * for a given fault address and return corresponding page fault handler status
+ * ie 0 on success or VM_FAULT_SIGBUS if migration failed for fault address.
+ *
+ * This is an helper intendend to be use by ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct hmm_migrate_ops *ops,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   hmm_pfn_t *pfns,
+			   void *private)
+{
+	if (hmm_vma_migrate(ops, vma, start, end, pfns, private))
+		return VM_FAULT_SIGBUS;
+
+	if (pfns[(addr - start) >> PAGE_SHIFT] & HMM_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [HMM v13 18/18] mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (16 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 17/18] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
@ 2016-11-18 18:18 ` Jérôme Glisse
  2016-11-19  0:41 ` [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 John Hubbard
  2016-11-23  9:16 ` Haggai Eran
  19 siblings, 0 replies; 73+ messages in thread
From: Jérôme Glisse @ 2016-11-18 18:18 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is usefull to device driver that want to manage multiple physical
device memory under same device umbrella.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 ++++++++++++-
 mm/hmm.c            | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ac0b69a..106de1f 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,10 +88,10 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -484,6 +484,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 }
 
 
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned		minor;
+};
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void);
+void hmm_device_put(struct hmm_device *hmm_device);
+
+
 /* Below are for HMM internal use only ! Not to be use by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/hmm.c b/mm/hmm.c
index f2ca895..61c3640187 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mmzone.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
@@ -1078,3 +1079,97 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt,0,1,"hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+	unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+	class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
@ 2016-11-18 19:57   ` Aneesh Kumar K.V
  2016-11-18 20:15     ` Jerome Glisse
  2016-11-19 14:32   ` Aneesh Kumar K.V
  2016-11-21  3:30   ` Balbir Singh
  2 siblings, 1 reply; 73+ messages in thread
From: Aneesh Kumar K.V @ 2016-11-18 19:57 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

Jérôme Glisse <jglisse@redhat.com> writes:

> This patch add a new memory migration helpers, which migrate memory
> backing a range of virtual address of a process to different memory
> (which can be allocated through special allocator). It differs from
> numa migration by working on a range of virtual address and thus by
> doing migration in chunk that can be large enough to use DMA engine
> or special copy offloading engine.
>
> Expected users are any one with heterogeneous memory where different
> memory have different characteristics (latency, bandwidth, ...). As
> an example IBM platform with CAPI bus can make use of this feature
> to migrate between regular memory and CAPI device memory. New CPU
> architecture with a pool of high performance memory not manage as
> cache but presented as regular memory (while being faster and with
> lower latency than DDR) will also be prime user of this patch.
>
> Migration to private device memory will be usefull for device that
> have large pool of such like GPU, NVidia plans to use HMM for that.
>



..............


>+
> +static int hmm_collect_walk_pmd(pmd_t *pmdp,
> +				unsigned long start,
> +				unsigned long end,
> +				struct mm_walk *walk)
> +{
> +	struct hmm_migrate *migrate = walk->private;
> +	struct mm_struct *mm = walk->vma->vm_mm;
> +	unsigned long addr = start;
> +	spinlock_t *ptl;
> +	hmm_pfn_t *pfns;
> +	int pages = 0;
> +	pte_t *ptep;
> +
> +again:
> +	if (pmd_none(*pmdp))
> +		return 0;
> +
> +	split_huge_pmd(walk->vma, pmdp, addr);
> +	if (pmd_trans_unstable(pmdp))
> +		goto again;
> +
> +	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +
> +	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
> +		unsigned long pfn;
> +		swp_entry_t entry;
> +		struct page *page;
> +		hmm_pfn_t flags;
> +		bool write;
> +		pte_t pte;
> +
> +		pte = ptep_get_and_clear(mm, addr, ptep);
> +		if (!pte_present(pte)) {
> +			if (pte_none(pte))
> +				continue;
> +
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_device_entry(entry)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				continue;
> +			}
> +
> +			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
> +			page = device_entry_to_page(entry);
> +			write = is_write_device_entry(entry);
> +			pfn = page_to_pfn(page);
> +
> +			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				continue;
> +			}
> +
> +		} else {
> +			pfn = pte_pfn(pte);
> +			page = pfn_to_page(pfn);
> +			write = pte_write(pte);
> +			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
> +		}
> +
> +		/* FIXME support THP see hmm_migrate_page_check() */
> +		if (PageTransCompound(page))
> +			continue;
> +
> +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> +		*pfns |= write ? HMM_PFN_WRITE : 0;
> +		migrate->npages++;
> +		get_page(page);
> +
> +		if (!trylock_page(page)) {
> +			set_pte_at(mm, addr, ptep, pte);
> +		} else {
> +			pte_t swp_pte;
> +
> +			*pfns |= HMM_PFN_LOCKED;
> +
> +			entry = make_migration_entry(page, write);
> +			swp_pte = swp_entry_to_pte(entry);
> +			if (pte_soft_dirty(pte))
> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			set_pte_at(mm, addr, ptep, swp_pte);
> +
> +			page_remove_rmap(page, false);
> +			put_page(page);
> +			pages++;
> +		}

Can you explain this. What does a failure to lock means here. Also why
convert the pte to migration entries here ? We do that in try_to_unmap right ?


> +	}
> +
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
> +	/* Only flush the TLB if we actually modified any entries */
> +	if (pages)
> +		flush_tlb_range(walk->vma, start, end);
> +
> +	return 0;
> +}
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-18 19:57   ` Aneesh Kumar K.V
@ 2016-11-18 20:15     ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-18 20:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Sat, Nov 19, 2016 at 01:27:28AM +0530, Aneesh Kumar K.V wrote:
> Jérôme Glisse <jglisse@redhat.com> writes:
>  
>
> [...]
>
> >+
> > +static int hmm_collect_walk_pmd(pmd_t *pmdp,
> > +				unsigned long start,
> > +				unsigned long end,
> > +				struct mm_walk *walk)
> > +{
> > +	struct hmm_migrate *migrate = walk->private;
> > +	struct mm_struct *mm = walk->vma->vm_mm;
> > +	unsigned long addr = start;
> > +	spinlock_t *ptl;
> > +	hmm_pfn_t *pfns;
> > +	int pages = 0;
> > +	pte_t *ptep;
> > +
> > +again:
> > +	if (pmd_none(*pmdp))
> > +		return 0;
> > +
> > +	split_huge_pmd(walk->vma, pmdp, addr);
> > +	if (pmd_trans_unstable(pmdp))
> > +		goto again;
> > +
> > +	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
> > +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +
> > +	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
> > +		unsigned long pfn;
> > +		swp_entry_t entry;
> > +		struct page *page;
> > +		hmm_pfn_t flags;
> > +		bool write;
> > +		pte_t pte;
> > +
> > +		pte = ptep_get_and_clear(mm, addr, ptep);
> > +		if (!pte_present(pte)) {
> > +			if (pte_none(pte))
> > +				continue;
> > +
> > +			entry = pte_to_swp_entry(pte);
> > +			if (!is_device_entry(entry)) {
> > +				set_pte_at(mm, addr, ptep, pte);
> > +				continue;
> > +			}
> > +
> > +			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
> > +			page = device_entry_to_page(entry);
> > +			write = is_write_device_entry(entry);
> > +			pfn = page_to_pfn(page);
> > +
> > +			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
> > +				set_pte_at(mm, addr, ptep, pte);
> > +				continue;
> > +			}
> > +
> > +		} else {
> > +			pfn = pte_pfn(pte);
> > +			page = pfn_to_page(pfn);
> > +			write = pte_write(pte);
> > +			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
> > +		}
> > +
> > +		/* FIXME support THP see hmm_migrate_page_check() */
> > +		if (PageTransCompound(page))
> > +			continue;
> > +
> > +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> > +		*pfns |= write ? HMM_PFN_WRITE : 0;
> > +		migrate->npages++;
> > +		get_page(page);
> > +
> > +		if (!trylock_page(page)) {
> > +			set_pte_at(mm, addr, ptep, pte);
> > +		} else {
> > +			pte_t swp_pte;
> > +
> > +			*pfns |= HMM_PFN_LOCKED;
> > +
> > +			entry = make_migration_entry(page, write);
> > +			swp_pte = swp_entry_to_pte(entry);
> > +			if (pte_soft_dirty(pte))
> > +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +			set_pte_at(mm, addr, ptep, swp_pte);
> > +
> > +			page_remove_rmap(page, false);
> > +			put_page(page);
> > +			pages++;
> > +		}
> 
> Can you explain this. What does a failure to lock means here. Also why
> convert the pte to migration entries here ? We do that in try_to_unmap right ?

This an optimization for the usual case where the memory is only use in one
process and that no concurrent migration/memory event is happening. Basicly
if we can lock the page without waiting then we unmap it and the later call
to try_to_unmap() is a no op.

This is purely to optimize this common case. In short it is doing try_to_unmap()
work ahead of time.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (17 preceding siblings ...)
  2016-11-18 18:18 ` [HMM v13 18/18] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
@ 2016-11-19  0:41 ` John Hubbard
  2016-11-19 14:50   ` Aneesh Kumar K.V
  2016-11-23  9:16 ` Haggai Eran
  19 siblings, 1 reply; 73+ messages in thread
From: John Hubbard @ 2016-11-19  0:41 UTC (permalink / raw)
  To: Jérôme Glisse; +Cc: akpm, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 9519 bytes --]

On Fri, 18 Nov 2016, Jérôme Glisse wrote:

> Cliff note: HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code. Second it allows
> to mirror process address space on a device.
> 
> Change since v12 is the use of struct page for device memory even if
> the device memory is not accessible by the CPU (because of limitation
> impose by the bus between the CPU and the device).
> 
> Using struct page means that their are minimal changes to core mm
> code. HMM build on top of ZONE_DEVICE to provide struct page, it
> adds new features to ZONE_DEVICE. The first 7 patches implement
> those changes.
> 
> Rest of patchset is divided into 3 features that can each be use
> independently from one another. First is the process address space
> mirroring (patch 9 to 13), this allow to snapshot CPU page table
> and to keep the device page table synchronize with the CPU one.
> 
> Second is a new memory migration helper which allow migration of
> a range of virtual address of a process. This memory migration
> also allow device to use their own DMA engine to perform the copy
> between the source memory and destination memory. This can be
> usefull even outside HMM context in many usecase.
> 
> Third part of the patchset (patch 17-18) is a set of helper to
> register a ZONE_DEVICE node and manage it. It is meant as a
> convenient helper so that device drivers do not each have to
> reimplement over and over the same boiler plate code.
> 
> 
> I am hoping that this can now be consider for inclusion upstream.
> Bottom line is that without HMM we can not support some of the new
> hardware features on x86 PCIE. I do believe we need some solution
> to support those features or we won't be able to use such hardware
> in standard like C++17, OpenCL 3.0 and others.
> 
> I have been working with NVidia to bring up this feature on their
> Pascal GPU. There are real hardware that you can buy today that
> could benefit from HMM. We also intend to leverage this inside the
> open source nouveau driver.
> 

Hi,

We (NVIDIA engineering) have been working closely with Jerome on this for 
several years now, and I wanted to mention that NVIDIA is committed to 
using HMM. We've done initial testing of this patchset on Pascal GPUs (a 
bit more detail below) and it is looking good.
  
The HMM features are a prerequisite to an important part of NVIDIA's 
efforts to make writing code for GPUs (and other page-faulting devices) 
easier--by making it more like writing code for CPUs. A big part of that 
story involves being able to use malloc'd memory transparently everywhere. 
Here's a tiny example (in case it's not obvious from the HMM patchset 
documentation) of HMM in action:

        int *p = (int*)malloc(SIZE); *p = 5; /* on the CPU */

        x = *p;   /* on a GPU, or on any page-fault-capable device */
   
1. A device page fault occurs because the malloc'd memory was never 
allocated in the device's page tables.
  
2. The device driver receives a page fault interrupt, but fails to 
recognize the address, so it calls into HMM.

3. HMM knows that p is valid on the CPU, and coordinates with the device 
driver to unmap the CPU page, allocate a page on the device, and then 
migrate (copy) the data to the device. This allows full device memory 
bandwidth to be available, which is critical to getting good performance.

    a) Alternatively, leave the page on the CPU, and create a device 
PTE to point to that page. This might be done if our performance counters 
show that a page is thrashing.
   
4. The device driver issues a replay-page-fault to the device.
 
5. The device program continues running, and x == 5 now.

When version 1 of this patchset was created (2.5 years ago! in May, 2014), 
one huge concern was that we didn't yet have hardware that could use it.  
But now we do: Pascal GPUs, which have been shipping this year, all 
support replayable page faults.

Testing:

We have done some testing of this latest patchset on Pascal GPUs using our 
nvidia-uvm.ko module (which is open source, separate from the closed 
source nvidia.ko). There is still much more testing to do, of course, but 
basic page mirroring and page migration (between CPU and GPU), and even 
some multi-GPU cases, are all working.

We do think we've found a bug in a corner case that involves invalid GPU 
memory (of course, it's always possible that the bug is on our side), 
which Jerome is investigating now. If you spot the bug by inspection, 
you'll get some major told-you-so points. :)

The performance is looking good on the testing we’ve done so far, too.

thanks,

John Hubbard
NVIDIA Systems Software Engineer

> 
> In this patchset i restricted myself to set of core features what
> is missing:
>   - force read only on CPU for memory duplication and GPU atomic
>   - changes to mmu_notifier for optimization purposes
>   - migration of file back page to device memory
> 
> I plan to submit a couple more patchset to implement those feature
> once core HMM is upstream.
> 
> 
> Is there anything blocking HMM inclusion ? Something fundamental ?
> 
> 
> Previous patchset posting :
>     v1 http://lwn.net/Articles/597289/
>     v2 https://lkml.org/lkml/2014/6/12/559
>     v3 https://lkml.org/lkml/2014/6/13/633
>     v4 https://lkml.org/lkml/2014/8/29/423
>     v5 https://lkml.org/lkml/2014/11/3/759
>     v6 http://lwn.net/Articles/619737/
>     v7 http://lwn.net/Articles/627316/
>     v8 https://lwn.net/Articles/645515/
>     v9 https://lwn.net/Articles/651553/
>     v10 https://lwn.net/Articles/654430/
>     v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
>     v12 http://www.kernelhub.org/?msg=972982&p=2
> 
> Cheers,
> Jérôme
> 
> Jérôme Glisse (18):
>   mm/memory/hotplug: convert device parameter bool to set of flags
>   mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
>     memory
>   mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages
>   mm/ZONE_DEVICE/free-page: callback when page is freed
>   mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device
>     memory
>   mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
>   mm/ZONE_DEVICE/x86: add support for un-addressable device memory
>   mm/hmm: heterogeneous memory management (HMM for short)
>   mm/hmm/mirror: mirror process address space on device with HMM helpers
>   mm/hmm/mirror: add range lock helper, prevent CPU page table update
>     for the range
>   mm/hmm/mirror: add range monitor helper, to monitor CPU page table
>     update
>   mm/hmm/mirror: helper to snapshot CPU page table
>   mm/hmm/mirror: device page fault handler
>   mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
>   mm/hmm/migrate: add new boolean copy flag to migratepage() callback
>   mm/hmm/migrate: new memory migration helper for use with device memory
>   mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
>   mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory
> 
>  MAINTAINERS                                |    7 +
>  arch/ia64/mm/init.c                        |   19 +-
>  arch/powerpc/mm/mem.c                      |   18 +-
>  arch/s390/mm/init.c                        |   10 +-
>  arch/sh/mm/init.c                          |   18 +-
>  arch/tile/mm/init.c                        |   10 +-
>  arch/x86/mm/init_32.c                      |   19 +-
>  arch/x86/mm/init_64.c                      |   23 +-
>  drivers/dax/pmem.c                         |    3 +-
>  drivers/nvdimm/pmem.c                      |    5 +-
>  drivers/staging/lustre/lustre/llite/rw26.c |    8 +-
>  fs/aio.c                                   |    7 +-
>  fs/btrfs/disk-io.c                         |   11 +-
>  fs/hugetlbfs/inode.c                       |    9 +-
>  fs/nfs/internal.h                          |    5 +-
>  fs/nfs/write.c                             |    9 +-
>  fs/proc/task_mmu.c                         |   10 +-
>  fs/ubifs/file.c                            |    8 +-
>  include/linux/balloon_compaction.h         |    3 +-
>  include/linux/fs.h                         |   13 +-
>  include/linux/hmm.h                        |  516 ++++++++++++
>  include/linux/memory_hotplug.h             |   17 +-
>  include/linux/memremap.h                   |   39 +-
>  include/linux/migrate.h                    |    7 +-
>  include/linux/mm_types.h                   |    5 +
>  include/linux/swap.h                       |   18 +-
>  include/linux/swapops.h                    |   67 ++
>  kernel/fork.c                              |    2 +
>  kernel/memremap.c                          |   48 +-
>  mm/Kconfig                                 |   23 +
>  mm/Makefile                                |    1 +
>  mm/balloon_compaction.c                    |    2 +-
>  mm/hmm.c                                   | 1175 ++++++++++++++++++++++++++++
>  mm/memory.c                                |   33 +
>  mm/memory_hotplug.c                        |    4 +-
>  mm/migrate.c                               |  651 ++++++++++++++-
>  mm/mprotect.c                              |   12 +
>  mm/page_alloc.c                            |   10 +
>  mm/rmap.c                                  |   47 ++
>  tools/testing/nvdimm/test/iomap.c          |    2 +-
>  40 files changed, 2811 insertions(+), 83 deletions(-)
>  create mode 100644 include/linux/hmm.h
>  create mode 100644 mm/hmm.c
> 
> -- 
> 2.4.3
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
  2016-11-18 19:57   ` Aneesh Kumar K.V
@ 2016-11-19 14:32   ` Aneesh Kumar K.V
  2016-11-19 17:17     ` Jerome Glisse
  2016-11-21  3:30   ` Balbir Singh
  2 siblings, 1 reply; 73+ messages in thread
From: Aneesh Kumar K.V @ 2016-11-19 14:32 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

Jérôme Glisse <jglisse@redhat.com> writes:

> This patch add a new memory migration helpers, which migrate memory
> backing a range of virtual address of a process to different memory
> (which can be allocated through special allocator). It differs from
> numa migration by working on a range of virtual address and thus by
> doing migration in chunk that can be large enough to use DMA engine
> or special copy offloading engine.
>
> Expected users are any one with heterogeneous memory where different
> memory have different characteristics (latency, bandwidth, ...). As
> an example IBM platform with CAPI bus can make use of this feature
> to migrate between regular memory and CAPI device memory. New CPU
> architecture with a pool of high performance memory not manage as
> cache but presented as regular memory (while being faster and with
> lower latency than DDR) will also be prime user of this patch.
>
> Migration to private device memory will be usefull for device that
> have large pool of such like GPU, NVidia plans to use HMM for that.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  include/linux/hmm.h |  54 ++++-
>  mm/migrate.c        | 584 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 635 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index c79abfc..9777309 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -101,10 +101,13 @@ struct hmm;
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
>   * HMM_PFN_FAULT: use by hmm_vma_fault() to signify which address need faulting
>   * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
> + * HMM_PFN_LOCKED: underlying struct page is lock
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
>   *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
>   *      device (the entry will never have HMM_PFN_VALID set and the pfn value
>   *      is undefine)
> + * HMM_PFN_MIGRATE: use by hmm_vma_migrate() to signify which address can be
> + *      migrated
>   * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
>   */
>  typedef unsigned long hmm_pfn_t;
> @@ -116,9 +119,11 @@ typedef unsigned long hmm_pfn_t;
>  #define HMM_PFN_EMPTY (1 << 4)
>  #define HMM_PFN_FAULT (1 << 5)
>  #define HMM_PFN_DEVICE (1 << 6)
> -#define HMM_PFN_SPECIAL (1 << 7)
> -#define HMM_PFN_UNADDRESSABLE (1 << 8)
> -#define HMM_PFN_SHIFT 9
> +#define HMM_PFN_LOCKED (1 << 7)
> +#define HMM_PFN_SPECIAL (1 << 8)
> +#define HMM_PFN_MIGRATE (1 << 9)
> +#define HMM_PFN_UNADDRESSABLE (1 << 10)
> +#define HMM_PFN_SHIFT 11
>  
>  static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
>  {
> @@ -323,6 +328,49 @@ bool hmm_vma_fault(struct vm_area_struct *vma,
>  		   hmm_pfn_t *pfns);
>  
>  
> +/*
> + * struct hmm_migrate_ops - migrate operation callback
> + *
> + * @alloc_and_copy: alloc destination memoiry and copy source to it
> + * @finalize_and_map: allow caller to inspect successfull migrated page
> + *
> + * The new HMM migrate helper hmm_vma_migrate() allow memory migration to use
> + * device DMA engine to perform copy from source to destination memory it also
> + * allow caller to use its own memory allocator for destination memory.
> + *
> + * Note that in alloc_and_copy device driver can decide not to migrate some of
> + * the entry, for those it must clear the HMM_PFN_MIGRATE flag. The destination
> + * page must lock and the corresponding hmm_pfn_t value in the array updated
> + * with the HMM_PFN_MIGRATE and HMM_PFN_LOCKED flag set (and of course be a
> + * valid entry). It is expected that the page allocated will have an elevated
> + * refcount and that a put_page() will free the page. Device driver might want
> + * to allocate with an extra-refcount if they want to control deallocation of
> + * failed migration inside the finalize_and_map() callback.
> + *
> + * Inside finalize_and_map() device driver must use the HMM_PFN_MIGRATE flag to
> + * determine which page have been successfully migrated.
> + */
> +struct hmm_migrate_ops {
> +	void (*alloc_and_copy)(struct vm_area_struct *vma,
> +			       unsigned long start,
> +			       unsigned long end,
> +			       hmm_pfn_t *pfns,
> +			       void *private);
> +	void (*finalize_and_map)(struct vm_area_struct *vma,
> +				 unsigned long start,
> +				 unsigned long end,
> +				 hmm_pfn_t *pfns,
> +				 void *private);
> +};
> +
> +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> +		    struct vm_area_struct *vma,
> +		    unsigned long start,
> +		    unsigned long end,
> +		    hmm_pfn_t *pfns,
> +		    void *private);
> +
> +
>  /* Below are for HMM internal use only ! Not to be use by device driver ! */
>  void hmm_mm_destroy(struct mm_struct *mm);
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d9ce8db..393d592 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -41,6 +41,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/page_owner.h>
>  #include <linux/memremap.h>
> +#include <linux/hmm.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -421,6 +422,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  	int expected_count = 1 + extra_count;
>  	void **pslot;
>  
> +	/*
> +	 * ZONE_DEVICE pages have 1 refcount always held by their device
> +	 *
> +	 * Note that DAX memory will never reach that point as it does not have
> +	 * the MEMORY_MOVABLE flag set (see include/linux/memory_hotplug.h).
> +	 */
> +	expected_count += is_zone_device_page(page);
> +
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
>  		if (page_count(page) != expected_count)
> @@ -2087,3 +2096,578 @@ out_unlock:
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #endif /* CONFIG_NUMA */
> +
> +
> +#if defined(CONFIG_HMM)
> +struct hmm_migrate {
> +	struct vm_area_struct	*vma;
> +	unsigned long		start;
> +	unsigned long		end;
> +	unsigned long		npages;
> +	hmm_pfn_t		*pfns;
> +};
> +
> +static int hmm_collect_walk_pmd(pmd_t *pmdp,
> +				unsigned long start,
> +				unsigned long end,
> +				struct mm_walk *walk)
> +{
> +	struct hmm_migrate *migrate = walk->private;
> +	struct mm_struct *mm = walk->vma->vm_mm;
> +	unsigned long addr = start;
> +	spinlock_t *ptl;
> +	hmm_pfn_t *pfns;
> +	int pages = 0;
> +	pte_t *ptep;
> +
> +again:
> +	if (pmd_none(*pmdp))
> +		return 0;
> +
> +	split_huge_pmd(walk->vma, pmdp, addr);
> +	if (pmd_trans_unstable(pmdp))
> +		goto again;
> +
> +	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +
> +	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
> +		unsigned long pfn;
> +		swp_entry_t entry;
> +		struct page *page;
> +		hmm_pfn_t flags;
> +		bool write;
> +		pte_t pte;
> +
> +		pte = ptep_get_and_clear(mm, addr, ptep);
> +		if (!pte_present(pte)) {
> +			if (pte_none(pte))
> +				continue;
> +
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_device_entry(entry)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				continue;
> +			}
> +
> +			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
> +			page = device_entry_to_page(entry);
> +			write = is_write_device_entry(entry);
> +			pfn = page_to_pfn(page);
> +
> +			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				continue;
> +			}
> +
> +		} else {
> +			pfn = pte_pfn(pte);
> +			page = pfn_to_page(pfn);
> +			write = pte_write(pte);
> +			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;

Will that is_zone_device_page() be ever true ?, The pte is present in
the else patch can the struct page backing that come from zone device ?


> +		}
> +
> +		/* FIXME support THP see hmm_migrate_page_check() */
> +		if (PageTransCompound(page))
> +			continue;

What about page cache pages. Do we support that ? If not may be skip
that here ?



> +
> +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> +		*pfns |= write ? HMM_PFN_WRITE : 0;
> +		migrate->npages++;
> +		get_page(page);
> +
> +		if (!trylock_page(page)) {
> +			set_pte_at(mm, addr, ptep, pte);
> +		} else {
> +			pte_t swp_pte;
> +
> +			*pfns |= HMM_PFN_LOCKED;
> +
> +			entry = make_migration_entry(page, write);
> +			swp_pte = swp_entry_to_pte(entry);
> +			if (pte_soft_dirty(pte))
> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			set_pte_at(mm, addr, ptep, swp_pte);
> +
> +			page_remove_rmap(page, false);
> +			put_page(page);
> +			pages++;
> +		}

If this is an optimization, can we get that as a seperate patch with
addtional comments. ? How does take a successful page lock implies it is
not a shared mapping ?


> +	}
> +
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
> +	/* Only flush the TLB if we actually modified any entries */
> +	if (pages)
> +		flush_tlb_range(walk->vma, start, end);
> +
> +	return 0;
> +}


So without the optimization the above function is suppose to raise the
refcount and collect all possible pfns tha we can migrate in the array ?


> +
> +static void hmm_migrate_collect(struct hmm_migrate *migrate)
> +{
> +	struct mm_walk mm_walk;
> +
> +	mm_walk.pmd_entry = hmm_collect_walk_pmd;
> +	mm_walk.pte_entry = NULL;
> +	mm_walk.pte_hole = NULL;
> +	mm_walk.hugetlb_entry = NULL;
> +	mm_walk.test_walk = NULL;
> +	mm_walk.vma = migrate->vma;
> +	mm_walk.mm = migrate->vma->vm_mm;
> +	mm_walk.private = migrate;
> +
> +	mmu_notifier_invalidate_range_start(mm_walk.mm,
> +					    migrate->start,
> +					    migrate->end);
> +	walk_page_range(migrate->start, migrate->end, &mm_walk);
> +	mmu_notifier_invalidate_range_end(mm_walk.mm,
> +					  migrate->start,
> +					  migrate->end);
> +}
> +
> +static inline bool hmm_migrate_page_check(struct page *page, int extra)
> +{
> +	/*
> +	 * FIXME support THP (transparent huge page), it is bit more complex to
> +	 * check them then regular page because they can be map with a pmd or
> +	 * with a pte (split pte mapping).
> +	 */
> +	if (PageCompound(page))
> +		return false;
> +
> +	if (is_zone_device_page(page))
> +		extra++;
> +
> +	if ((page_count(page) - extra) > page_mapcount(page))
> +		return false;
> +
> +	return true;
> +}
> +
> +static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
> +{
> +	unsigned long addr = migrate->start, i = 0;
> +	struct mm_struct *mm = migrate->vma->vm_mm;
> +	struct vm_area_struct *vma = migrate->vma;
> +	unsigned long restore = 0;
> +	bool allow_drain = true;
> +
> +	lru_add_drain();
> +
> +again:
> +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
> +			lock_page(page);
> +			migrate->pfns[i] |= HMM_PFN_LOCKED;
> +		}

What does taking a page_lock protect against ? Can we document that ?

> +
> +		/* ZONE_DEVICE page are not on LRU */
> +		if (is_zone_device_page(page))
> +			goto check;
> +
> +		if (!PageLRU(page) && allow_drain) {
> +			/* Drain CPU's pagevec so page can be isolated */
> +			lru_add_drain_all();
> +			allow_drain = false;
> +			goto again;
> +		}
> +
> +		if (isolate_lru_page(page)) {
> +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +			migrate->npages--;
> +			put_page(page);
> +			restore++;
> +		} else
> +			/* Drop the reference we took in collect */
> +			put_page(page);
> +
> +check:
> +		if (!hmm_migrate_page_check(page, 1)) {
> +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +			migrate->npages--;
> +			restore++;
> +		}
> +	}
> +
> +	if (!restore)
> +		return;
> +
> +	for (addr = migrate->start, i = 0; addr < migrate->end;) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +		unsigned long next, restart;
> +		spinlock_t *ptl;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE)) {
> +			addr += PAGE_SIZE;
> +			i++;
> +			continue;
> +		}
> +
> +		restart = addr;
> +		pgdp = pgd_offset(mm, addr);
> +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pudp = pud_offset(pgdp, addr);
> +		if (!pudp || pud_none(*pudp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pmdp = pmd_offset(pudp, addr);
> +		next = pmd_addr_end(addr, migrate->end);
> +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> +			addr = next;
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> +			swp_entry_t entry;
> +			bool write;
> +			pte_t pte;
> +
> +			page = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> +				continue;
> +
> +			write = migrate->pfns[i] & HMM_PFN_WRITE;
> +			write &= (vma->vm_flags & VM_WRITE);
> +
> +			/* Here it means pte must be a valid migration entry */
> +			pte = ptep_get_and_clear(mm, addr, ptep);


It is already a non present pte, so why use ptep_get_and_clear ? why not
*ptep ? Some archs does lot of additional stuff in get_and_clear. 

> +			if (pte_none(pte) || pte_present(pte))
> +				/* SOMETHING BAD IS GOING ON ! */
> +				continue;
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_migration_entry(entry))
> +				/* SOMETHING BAD IS GOING ON ! */
> +				continue;
> +
> +			if (is_zone_device_page(page) &&
> +			    !is_addressable_page(page)) {
> +				entry = make_device_entry(page, write);
> +				pte = swp_entry_to_pte(entry);
> +			} else {
> +				pte = mk_pte(page, vma->vm_page_prot);
> +				pte = pte_mkold(pte);
> +				if (write)
> +					pte = pte_mkwrite(pte);
> +			}
> +			if (pte_swp_soft_dirty(*ptep))
> +				pte = pte_mksoft_dirty(pte);
> +
> +			get_page(page);
> +			set_pte_at(mm, addr, ptep, pte);
> +			if (PageAnon(page))
> +				page_add_anon_rmap(page, vma, addr, false);
> +			else
> +				page_add_file_rmap(page, false);

Do we support pagecache already ? Or is thise just a place holder ? if
so may be we can drop it and add it later when we add page cache
support. ?


> +		}
> +		pte_unmap_unlock(ptep - 1, ptl);
> +
> +		addr = restart;
> +		i = (addr - migrate->start) >> PAGE_SHIFT;
> +		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
> +			page = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> +				continue;
> +
> +			migrate->pfns[i] = 0;
> +			unlock_page(page);
> +			restore--;
> +
> +			if (is_zone_device_page(page)) {
> +				put_page(page);
> +				continue;
> +			}
> +
> +			putback_lru_page(page);
> +		}
> +
> +		if (!restore)
> +			break;
> +	}


All the above restore won't be needed if we didn't do that migration
entry setup in the first function right ? We just need to drop the
refcount for pages that we failed to isolated ? No need to walk the page
table etc ?

> +}
> +
> +static void hmm_migrate_unmap(struct hmm_migrate *migrate)
> +{
> +	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	unsigned long addr = migrate->start, i = 0, restore = 0;
> +
> +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +
> +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> +			continue;
> +
> +		try_to_unmap(page, flags);
> +		if (page_mapped(page) || !hmm_migrate_page_check(page, 1)) {
> +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +			migrate->npages--;
> +			restore++;
> +		}
> +	}
> +
> +	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +
> +		if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> +			continue;
> +
> +		remove_migration_ptes(page, page, false);
> +
> +		migrate->pfns[i] = 0;
> +		unlock_page(page);
> +		restore--;
> +
> +		if (is_zone_device_page(page)) {
> +			put_page(page);
> +			continue;
> +		}
> +
> +		putback_lru_page(page);

May be 

     } else 
	putback_lru_page(page);

> +	}
> +}
> +
> +static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
> +{
> +	unsigned long addr = migrate->start, i = 0;
> +	struct mm_struct *mm = migrate->vma->vm_mm;
> +
> +	for (; addr < migrate->end;) {
> +		unsigned long next;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		pgdp = pgd_offset(mm, addr);
> +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pudp = pud_offset(pgdp, addr);
> +		if (!pudp || pud_none(*pudp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pmdp = pmd_offset(pudp, addr);
> +		next = pmd_addr_end(addr, migrate->end);
> +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> +			addr = next;
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +
> +		/* No need to lock nothing can change from under us */
> +		ptep = pte_offset_map(pmdp, addr);
> +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> +			struct address_space *mapping;
> +			struct page *newpage, *page;
> +			swp_entry_t entry;
> +			int r;
> +
> +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!newpage || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> +				continue;
> +			if (pte_none(*ptep) || pte_present(*ptep)) {
> +				/* This should not happen but be nice */
> +				migrate->pfns[i] = 0;
> +				put_page(newpage);
> +				continue;
> +			}
> +			entry = pte_to_swp_entry(*ptep);
> +			if (!is_migration_entry(entry)) {
> +				/* This should not happen but be nice */
> +				migrate->pfns[i] = 0;
> +				put_page(newpage);
> +				continue;
> +			}
> +
> +			page = migration_entry_to_page(entry);
> +			mapping = page_mapping(page);
> +
> +			/*
> +			 * For now only support private anonymous when migrating
> +			 * to un-addressable device memory.
> +			 */
> +			if (mapping && is_zone_device_page(newpage) &&
> +			    !is_addressable_page(newpage)) {
> +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +				continue;
> +			}
> +
> +			r = migrate_page(mapping, newpage, page,
> +					 MIGRATE_SYNC, false);
> +			if (r != MIGRATEPAGE_SUCCESS)
> +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +		}
> +		pte_unmap(ptep - 1);
> +	}
> +}


Why are we walking the page table multiple times ? Is it that after
alloc_copy the content of migrate->pfns pfn array is now the new pfns ?
It is confusing that each of these functions walk one page table
multiple times (even when page can be shared). I was expecting us to
walk the page table once to collect the pfns/pages and then use that
in rest of the calls. Any specific reason you choose to implement it
this way ?


> +
> +static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
> +{
> +	unsigned long addr = migrate->start, i = 0;
> +	struct mm_struct *mm = migrate->vma->vm_mm;
> +
> +	for (; addr < migrate->end;) {
> +		unsigned long next;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		pgdp = pgd_offset(mm, addr);
> +		pudp = pud_offset(pgdp, addr);
> +		pmdp = pmd_offset(pudp, addr);
> +		next = pmd_addr_end(addr, migrate->end);
> +
> +		/* No need to lock nothing can change from under us */
> +		ptep = pte_offset_map(pmdp, addr);
> +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> +			struct page *page, *newpage;
> +			swp_entry_t entry;
> +
> +			if (pte_none(*ptep) || pte_present(*ptep))
> +				continue;
> +			entry = pte_to_swp_entry(*ptep);
> +			if (!is_migration_entry(entry))
> +				continue;
> +
> +			page = migration_entry_to_page(entry);
> +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!newpage)
> +				newpage = page;
> +			remove_migration_ptes(page, newpage, false);
> +
> +			migrate->pfns[i] = 0;
> +			unlock_page(page);
> +			migrate->npages--;
> +
> +			if (is_zone_device_page(page))
> +				put_page(page);
> +			else
> +				putback_lru_page(page);
> +
> +			if (newpage != page) {
> +				unlock_page(newpage);
> +				if (is_zone_device_page(newpage))
> +					put_page(newpage);
> +				else
> +					putback_lru_page(newpage);
> +			}
> +		}
> +		pte_unmap(ptep - 1);
> +	}
> +}
> +
> +/*
> + * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
> + *
> + * @ops: migration callback for allocating destination memory and copying
> + * @vma: virtual memory area containing the range to be migrated
> + * @start: start address of the range to migrate (inclusive)
> + * @end: end address of the range to migrate (exclusive)
> + * @pfns: array of hmm_pfn_t first containing source pfns then destination
> + * @private: pointer passed back to each of the callback
> + * Returns: 0 on success, error code otherwise
> + *
> + * This will try to migrate a range of memory using callback to allocate and
> + * copy memory from source to destination. This function will first collect,
> + * lock and unmap pages in the range and then call alloc_and_copy() callback
> + * for device driver to allocate destination memory and copy from source.
> + *
> + * Then it will proceed and try to effectively migrate the page (struct page
> + * metadata) a step that can fail for various reasons. Before updating CPU page
> + * table it will call finalize_and_map() callback so that device driver can
> + * inspect what have been successfully migrated and update its own page table
> + * (this latter aspect is not mandatory and only make sense for some user of
> + * this API).
> + *
> + * Finaly the function update CPU page table and unlock the pages before
> + * returning 0.
> + *
> + * It will return an error code only if one of the argument is invalid.
> + */
> +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> +		    struct vm_area_struct *vma,
> +		    unsigned long start,
> +		    unsigned long end,
> +		    hmm_pfn_t *pfns,
> +		    void *private)
> +{
> +	struct hmm_migrate migrate;
> +
> +	/* Sanity check the arguments */
> +	start &= PAGE_MASK;
> +	end &= PAGE_MASK;
> +	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
> +		return -EINVAL;
> +	if (!vma || !ops || !pfns || start >= end)
> +		return -EINVAL;
> +	if (start < vma->vm_start || start >= vma->vm_end)
> +		return -EINVAL;
> +	if (end <= vma->vm_start || end > vma->vm_end)
> +		return -EINVAL;
> +
> +	migrate.start = start;
> +	migrate.pfns = pfns;
> +	migrate.npages = 0;
> +	migrate.end = end;
> +	migrate.vma = vma;
> +
> +	/* Collect, and try to unmap source pages */
> +	hmm_migrate_collect(&migrate);
> +	if (!migrate.npages)
> +		return 0;
> +
> +	/* Lock and isolate page */
> +	hmm_migrate_lock_and_isolate(&migrate);
> +	if (!migrate.npages)
> +		return 0;
> +
> +	/* Unmap pages */
> +	hmm_migrate_unmap(&migrate);
> +	if (!migrate.npages)
> +		return 0;
> +
> +	/*
> +	 * At this point pages are lock and unmap and thus they have stable
> +	 * content and can safely be copied to destination memory that is
> +	 * allocated by the callback.
> +	 *
> +	 * Note that migration can fail in hmm_migrate_struct_page() for each
> +	 * individual page.
> +	 */
> +	ops->alloc_and_copy(vma, start, end, pfns, private);
> +
> +	/* This does the real migration of struct page */
> +	hmm_migrate_struct_page(&migrate);
> +
> +	ops->finalize_and_map(vma, start, end, pfns, private);
> +
> +	/* Unlock and remap pages */
> +	hmm_migrate_remove_migration_pte(&migrate);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_vma_migrate);
> +#endif /* CONFIG_HMM */

IMHO If we can get each of the above functions documented properly it will
help with code review. Also if we can avoid that multiple page table
walk, it will make it closer to the existing migration logic.

-aneesh

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13
  2016-11-19  0:41 ` [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 John Hubbard
@ 2016-11-19 14:50   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 73+ messages in thread
From: Aneesh Kumar K.V @ 2016-11-19 14:50 UTC (permalink / raw)
  To: John Hubbard, Jérôme Glisse; +Cc: akpm, linux-kernel, linux-mm

John Hubbard <jhubbard@nvidia.com> writes:

> On Fri, 18 Nov 2016, Jérôme Glisse wrote:
>
>> Cliff note: HMM offers 2 things (each standing on its own). First
>> it allows to use device memory transparently inside any process
>> without any modifications to process program code. Second it allows
>> to mirror process address space on a device.
>> 
>> Change since v12 is the use of struct page for device memory even if
>> the device memory is not accessible by the CPU (because of limitation
>> impose by the bus between the CPU and the device).
>> 
>> Using struct page means that their are minimal changes to core mm
>> code. HMM build on top of ZONE_DEVICE to provide struct page, it
>> adds new features to ZONE_DEVICE. The first 7 patches implement
>> those changes.
>> 
>> Rest of patchset is divided into 3 features that can each be use
>> independently from one another. First is the process address space
>> mirroring (patch 9 to 13), this allow to snapshot CPU page table
>> and to keep the device page table synchronize with the CPU one.
>> 
>> Second is a new memory migration helper which allow migration of
>> a range of virtual address of a process. This memory migration
>> also allow device to use their own DMA engine to perform the copy
>> between the source memory and destination memory. This can be
>> usefull even outside HMM context in many usecase.
>> 
>> Third part of the patchset (patch 17-18) is a set of helper to
>> register a ZONE_DEVICE node and manage it. It is meant as a
>> convenient helper so that device drivers do not each have to
>> reimplement over and over the same boiler plate code.
>> 
>> 
>> I am hoping that this can now be consider for inclusion upstream.
>> Bottom line is that without HMM we can not support some of the new
>> hardware features on x86 PCIE. I do believe we need some solution
>> to support those features or we won't be able to use such hardware
>> in standard like C++17, OpenCL 3.0 and others.
>> 
>> I have been working with NVidia to bring up this feature on their
>> Pascal GPU. There are real hardware that you can buy today that
>> could benefit from HMM. We also intend to leverage this inside the
>> open source nouveau driver.
>> 
>
> Hi,
>
> We (NVIDIA engineering) have been working closely with Jerome on this for 
> several years now, and I wanted to mention that NVIDIA is committed to 
> using HMM. We've done initial testing of this patchset on Pascal GPUs (a 
> bit more detail below) and it is looking good.
>   

This can also be used on IBM platforms like Minsky (
http://www.tomshardware.com/news/ibm-power8-nvidia-tesla-p100-minsky,32661.html
)

There is also discussion around using this for device accelerated page
migration. That can help with coherent device memory node work.
(https://lkml.kernel.org/r/1477283517-2504-1-git-send-email-khandual@linux.vnet.ibm.com)

-aneesh

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-19 14:32   ` Aneesh Kumar K.V
@ 2016-11-19 17:17     ` Jerome Glisse
  2016-11-20 18:21       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-19 17:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Sat, Nov 19, 2016 at 08:02:26PM +0530, Aneesh Kumar K.V wrote:
> Jérôme Glisse <jglisse@redhat.com> writes:
> 
> > This patch add a new memory migration helpers, which migrate memory
> > backing a range of virtual address of a process to different memory
> > (which can be allocated through special allocator). It differs from
> > numa migration by working on a range of virtual address and thus by
> > doing migration in chunk that can be large enough to use DMA engine
> > or special copy offloading engine.
> >
> > Expected users are any one with heterogeneous memory where different
> > memory have different characteristics (latency, bandwidth, ...). As
> > an example IBM platform with CAPI bus can make use of this feature
> > to migrate between regular memory and CAPI device memory. New CPU
> > architecture with a pool of high performance memory not manage as
> > cache but presented as regular memory (while being faster and with
> > lower latency than DDR) will also be prime user of this patch.
> >
> > Migration to private device memory will be usefull for device that
> > have large pool of such like GPU, NVidia plans to use HMM for that.
> >
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> > Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> > Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> > ---
> >  include/linux/hmm.h |  54 ++++-
> >  mm/migrate.c        | 584 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 635 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index c79abfc..9777309 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -101,10 +101,13 @@ struct hmm;
> >   * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
> >   * HMM_PFN_FAULT: use by hmm_vma_fault() to signify which address need faulting
> >   * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
> > + * HMM_PFN_LOCKED: underlying struct page is lock
> >   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
> >   *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
> >   *      device (the entry will never have HMM_PFN_VALID set and the pfn value
> >   *      is undefine)
> > + * HMM_PFN_MIGRATE: use by hmm_vma_migrate() to signify which address can be
> > + *      migrated
> >   * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
> >   */
> >  typedef unsigned long hmm_pfn_t;
> > @@ -116,9 +119,11 @@ typedef unsigned long hmm_pfn_t;
> >  #define HMM_PFN_EMPTY (1 << 4)
> >  #define HMM_PFN_FAULT (1 << 5)
> >  #define HMM_PFN_DEVICE (1 << 6)
> > -#define HMM_PFN_SPECIAL (1 << 7)
> > -#define HMM_PFN_UNADDRESSABLE (1 << 8)
> > -#define HMM_PFN_SHIFT 9
> > +#define HMM_PFN_LOCKED (1 << 7)
> > +#define HMM_PFN_SPECIAL (1 << 8)
> > +#define HMM_PFN_MIGRATE (1 << 9)
> > +#define HMM_PFN_UNADDRESSABLE (1 << 10)
> > +#define HMM_PFN_SHIFT 11
> >  
> >  static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
> >  {
> > @@ -323,6 +328,49 @@ bool hmm_vma_fault(struct vm_area_struct *vma,
> >  		   hmm_pfn_t *pfns);
> >  
> >  
> > +/*
> > + * struct hmm_migrate_ops - migrate operation callback
> > + *
> > + * @alloc_and_copy: alloc destination memoiry and copy source to it
> > + * @finalize_and_map: allow caller to inspect successfull migrated page
> > + *
> > + * The new HMM migrate helper hmm_vma_migrate() allow memory migration to use
> > + * device DMA engine to perform copy from source to destination memory it also
> > + * allow caller to use its own memory allocator for destination memory.
> > + *
> > + * Note that in alloc_and_copy device driver can decide not to migrate some of
> > + * the entry, for those it must clear the HMM_PFN_MIGRATE flag. The destination
> > + * page must lock and the corresponding hmm_pfn_t value in the array updated
> > + * with the HMM_PFN_MIGRATE and HMM_PFN_LOCKED flag set (and of course be a
> > + * valid entry). It is expected that the page allocated will have an elevated
> > + * refcount and that a put_page() will free the page. Device driver might want
> > + * to allocate with an extra-refcount if they want to control deallocation of
> > + * failed migration inside the finalize_and_map() callback.
> > + *
> > + * Inside finalize_and_map() device driver must use the HMM_PFN_MIGRATE flag to
> > + * determine which page have been successfully migrated.
> > + */
> > +struct hmm_migrate_ops {
> > +	void (*alloc_and_copy)(struct vm_area_struct *vma,
> > +			       unsigned long start,
> > +			       unsigned long end,
> > +			       hmm_pfn_t *pfns,
> > +			       void *private);
> > +	void (*finalize_and_map)(struct vm_area_struct *vma,
> > +				 unsigned long start,
> > +				 unsigned long end,
> > +				 hmm_pfn_t *pfns,
> > +				 void *private);
> > +};
> > +
> > +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> > +		    struct vm_area_struct *vma,
> > +		    unsigned long start,
> > +		    unsigned long end,
> > +		    hmm_pfn_t *pfns,
> > +		    void *private);
> > +
> > +
> >  /* Below are for HMM internal use only ! Not to be use by device driver ! */
> >  void hmm_mm_destroy(struct mm_struct *mm);
> >  
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index d9ce8db..393d592 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -41,6 +41,7 @@
> >  #include <linux/page_idle.h>
> >  #include <linux/page_owner.h>
> >  #include <linux/memremap.h>
> > +#include <linux/hmm.h>
> >  
> >  #include <asm/tlbflush.h>
> >  
> > @@ -421,6 +422,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
> >  	int expected_count = 1 + extra_count;
> >  	void **pslot;
> >  
> > +	/*
> > +	 * ZONE_DEVICE pages have 1 refcount always held by their device
> > +	 *
> > +	 * Note that DAX memory will never reach that point as it does not have
> > +	 * the MEMORY_MOVABLE flag set (see include/linux/memory_hotplug.h).
> > +	 */
> > +	expected_count += is_zone_device_page(page);
> > +
> >  	if (!mapping) {
> >  		/* Anonymous page without mapping */
> >  		if (page_count(page) != expected_count)
> > @@ -2087,3 +2096,578 @@ out_unlock:
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> >  #endif /* CONFIG_NUMA */
> > +
> > +
> > +#if defined(CONFIG_HMM)
> > +struct hmm_migrate {
> > +	struct vm_area_struct	*vma;
> > +	unsigned long		start;
> > +	unsigned long		end;
> > +	unsigned long		npages;
> > +	hmm_pfn_t		*pfns;
> > +};
> > +
> > +static int hmm_collect_walk_pmd(pmd_t *pmdp,
> > +				unsigned long start,
> > +				unsigned long end,
> > +				struct mm_walk *walk)
> > +{
> > +	struct hmm_migrate *migrate = walk->private;
> > +	struct mm_struct *mm = walk->vma->vm_mm;
> > +	unsigned long addr = start;
> > +	spinlock_t *ptl;
> > +	hmm_pfn_t *pfns;
> > +	int pages = 0;
> > +	pte_t *ptep;
> > +
> > +again:
> > +	if (pmd_none(*pmdp))
> > +		return 0;
> > +
> > +	split_huge_pmd(walk->vma, pmdp, addr);
> > +	if (pmd_trans_unstable(pmdp))
> > +		goto again;
> > +
> > +	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
> > +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +
> > +	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
> > +		unsigned long pfn;
> > +		swp_entry_t entry;
> > +		struct page *page;
> > +		hmm_pfn_t flags;
> > +		bool write;
> > +		pte_t pte;
> > +
> > +		pte = ptep_get_and_clear(mm, addr, ptep);
> > +		if (!pte_present(pte)) {
> > +			if (pte_none(pte))
> > +				continue;
> > +
> > +			entry = pte_to_swp_entry(pte);
> > +			if (!is_device_entry(entry)) {
> > +				set_pte_at(mm, addr, ptep, pte);
> > +				continue;
> > +			}
> > +
> > +			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
> > +			page = device_entry_to_page(entry);
> > +			write = is_write_device_entry(entry);
> > +			pfn = page_to_pfn(page);
> > +
> > +			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
> > +				set_pte_at(mm, addr, ptep, pte);
> > +				continue;
> > +			}
> > +
> > +		} else {
> > +			pfn = pte_pfn(pte);
> > +			page = pfn_to_page(pfn);
> > +			write = pte_write(pte);
> > +			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
> 
> Will that is_zone_device_page() be ever true ?, The pte is present in
> the else patch can the struct page backing that come from zone device ?

Yes, for ZONE_DEVICE on architecture with CAPI like bus you can have ZONE_DEVICE
memory map as device memory is accessible.

> 
> 
> > +		}
> > +
> > +		/* FIXME support THP see hmm_migrate_page_check() */
> > +		if (PageTransCompound(page))
> > +			continue;
> 
> What about page cache pages. Do we support that ? If not may be skip
> that here ?

No, page cache is supported, it will fail latter in the process if it try to
migrate to un-accessible memory which need special handling from address_space
point of view (handling read/write and writeback).


> > +
> > +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> > +		*pfns |= write ? HMM_PFN_WRITE : 0;
> > +		migrate->npages++;
> > +		get_page(page);
> > +
> > +		if (!trylock_page(page)) {
> > +			set_pte_at(mm, addr, ptep, pte);
> > +		} else {
> > +			pte_t swp_pte;
> > +
> > +			*pfns |= HMM_PFN_LOCKED;
> > +
> > +			entry = make_migration_entry(page, write);
> > +			swp_pte = swp_entry_to_pte(entry);
> > +			if (pte_soft_dirty(pte))
> > +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +			set_pte_at(mm, addr, ptep, swp_pte);
> > +
> > +			page_remove_rmap(page, false);
> > +			put_page(page);
> > +			pages++;
> > +		}
> 
> If this is an optimization, can we get that as a seperate patch with
> addtional comments. ? How does take a successful page lock implies it is
> not a shared mapping ?

It can be a share mapping and that's fine, migration only fail if page is
pin.
 

> > +	}
> > +
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(ptep - 1, ptl);
> > +
> > +	/* Only flush the TLB if we actually modified any entries */
> > +	if (pages)
> > +		flush_tlb_range(walk->vma, start, end);
> > +
> > +	return 0;
> > +}
> 
> 
> So without the optimization the above function is suppose to raise the
> refcount and collect all possible pfns tha we can migrate in the array ?

Yes correct, this function collect all page we can migrate in the range.
 

> > +
> > +static void hmm_migrate_collect(struct hmm_migrate *migrate)
> > +{
> > +	struct mm_walk mm_walk;
> > +
> > +	mm_walk.pmd_entry = hmm_collect_walk_pmd;
> > +	mm_walk.pte_entry = NULL;
> > +	mm_walk.pte_hole = NULL;
> > +	mm_walk.hugetlb_entry = NULL;
> > +	mm_walk.test_walk = NULL;
> > +	mm_walk.vma = migrate->vma;
> > +	mm_walk.mm = migrate->vma->vm_mm;
> > +	mm_walk.private = migrate;
> > +
> > +	mmu_notifier_invalidate_range_start(mm_walk.mm,
> > +					    migrate->start,
> > +					    migrate->end);
> > +	walk_page_range(migrate->start, migrate->end, &mm_walk);
> > +	mmu_notifier_invalidate_range_end(mm_walk.mm,
> > +					  migrate->start,
> > +					  migrate->end);
> > +}
> > +
> > +static inline bool hmm_migrate_page_check(struct page *page, int extra)
> > +{
> > +	/*
> > +	 * FIXME support THP (transparent huge page), it is bit more complex to
> > +	 * check them then regular page because they can be map with a pmd or
> > +	 * with a pte (split pte mapping).
> > +	 */
> > +	if (PageCompound(page))
> > +		return false;
> > +
> > +	if (is_zone_device_page(page))
> > +		extra++;
> > +
> > +	if ((page_count(page) - extra) > page_mapcount(page))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
> > +{
> > +	unsigned long addr = migrate->start, i = 0;
> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> > +	struct vm_area_struct *vma = migrate->vma;
> > +	unsigned long restore = 0;
> > +	bool allow_drain = true;
> > +
> > +	lru_add_drain();
> > +
> > +again:
> > +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
> > +			lock_page(page);
> > +			migrate->pfns[i] |= HMM_PFN_LOCKED;
> > +		}
> 
> What does taking a page_lock protect against ? Can we document that ?

This usual page migration process like existing code, page lock protect against
anyone trying to map the page inside another process or at different address. It
also block few fs operations. I don't think there is a comprehensive list anywhere
but i can try to make one.
 
> > +
> > +		/* ZONE_DEVICE page are not on LRU */
> > +		if (is_zone_device_page(page))
> > +			goto check;
> > +
> > +		if (!PageLRU(page) && allow_drain) {
> > +			/* Drain CPU's pagevec so page can be isolated */
> > +			lru_add_drain_all();
> > +			allow_drain = false;
> > +			goto again;
> > +		}
> > +
> > +		if (isolate_lru_page(page)) {
> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +			migrate->npages--;
> > +			put_page(page);
> > +			restore++;
> > +		} else
> > +			/* Drop the reference we took in collect */
> > +			put_page(page);
> > +
> > +check:
> > +		if (!hmm_migrate_page_check(page, 1)) {
> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +			migrate->npages--;
> > +			restore++;
> > +		}
> > +	}
> > +
> > +	if (!restore)
> > +		return;
> > +
> > +	for (addr = migrate->start, i = 0; addr < migrate->end;) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +		unsigned long next, restart;
> > +		spinlock_t *ptl;
> > +		pgd_t *pgdp;
> > +		pud_t *pudp;
> > +		pmd_t *pmdp;
> > +		pte_t *ptep;
> > +
> > +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE)) {
> > +			addr += PAGE_SIZE;
> > +			i++;
> > +			continue;
> > +		}
> > +
> > +		restart = addr;
> > +		pgdp = pgd_offset(mm, addr);
> > +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pudp = pud_offset(pgdp, addr);
> > +		if (!pudp || pud_none(*pudp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pmdp = pmd_offset(pudp, addr);
> > +		next = pmd_addr_end(addr, migrate->end);
> > +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> > +			addr = next;
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> > +			swp_entry_t entry;
> > +			bool write;
> > +			pte_t pte;
> > +
> > +			page = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +				continue;
> > +
> > +			write = migrate->pfns[i] & HMM_PFN_WRITE;
> > +			write &= (vma->vm_flags & VM_WRITE);
> > +
> > +			/* Here it means pte must be a valid migration entry */
> > +			pte = ptep_get_and_clear(mm, addr, ptep);
> 
> 
> It is already a non present pte, so why use ptep_get_and_clear ? why not
> *ptep ? Some archs does lot of additional stuff in get_and_clear. 

Yes i can switch to *ptep, if memory serve this most likely because it was
cut and paste from collect function so there is no motivation behind the use
of get_and_clear

> > +			if (pte_none(pte) || pte_present(pte))
> > +				/* SOMETHING BAD IS GOING ON ! */
> > +				continue;
> > +			entry = pte_to_swp_entry(pte);
> > +			if (!is_migration_entry(entry))
> > +				/* SOMETHING BAD IS GOING ON ! */
> > +				continue;
> > +
> > +			if (is_zone_device_page(page) &&
> > +			    !is_addressable_page(page)) {
> > +				entry = make_device_entry(page, write);
> > +				pte = swp_entry_to_pte(entry);
> > +			} else {
> > +				pte = mk_pte(page, vma->vm_page_prot);
> > +				pte = pte_mkold(pte);
> > +				if (write)
> > +					pte = pte_mkwrite(pte);
> > +			}
> > +			if (pte_swp_soft_dirty(*ptep))
> > +				pte = pte_mksoft_dirty(pte);
> > +
> > +			get_page(page);
> > +			set_pte_at(mm, addr, ptep, pte);
> > +			if (PageAnon(page))
> > +				page_add_anon_rmap(page, vma, addr, false);
> > +			else
> > +				page_add_file_rmap(page, false);
> 
> Do we support pagecache already ? Or is thise just a place holder ? if
> so may be we can drop it and add it later when we add page cache
> support. ?

It support pagecache already. It does not support page cache migration to
un-addressable memory. So in CAPI case pagecache works but in x86/PCIE it
does not.

> 
> > +		}
> > +		pte_unmap_unlock(ptep - 1, ptl);
> > +
> > +		addr = restart;
> > +		i = (addr - migrate->start) >> PAGE_SHIFT;
> > +		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
> > +			page = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +				continue;
> > +
> > +			migrate->pfns[i] = 0;
> > +			unlock_page(page);
> > +			restore--;
> > +
> > +			if (is_zone_device_page(page)) {
> > +				put_page(page);
> > +				continue;
> > +			}
> > +
> > +			putback_lru_page(page);
> > +		}
> > +
> > +		if (!restore)
> > +			break;
> > +	}
> 
> 
> All the above restore won't be needed if we didn't do that migration
> entry setup in the first function right ? We just need to drop the
> refcount for pages that we failed to isolated ? No need to walk the page
> table etc ?

Well the migration entry setup is important so that no concurrent migration
can race with each other, the one that set the migration entry first is the
one that win in respect of migration. Also the CPU page table entry need to
be clear so that page content is stable and DMA copy does not miss any data
left over in some cache.

> 
> > +}
> > +
> > +static void hmm_migrate_unmap(struct hmm_migrate *migrate)
> > +{
> > +	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> > +	unsigned long addr = migrate->start, i = 0, restore = 0;
> > +
> > +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +
> > +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +			continue;
> > +
> > +		try_to_unmap(page, flags);
> > +		if (page_mapped(page) || !hmm_migrate_page_check(page, 1)) {
> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +			migrate->npages--;
> > +			restore++;
> > +		}
> > +	}
> > +
> > +	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +
> > +		if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +			continue;
> > +
> > +		remove_migration_ptes(page, page, false);
> > +
> > +		migrate->pfns[i] = 0;
> > +		unlock_page(page);
> > +		restore--;
> > +
> > +		if (is_zone_device_page(page)) {
> > +			put_page(page);
> > +			continue;
> > +		}
> > +
> > +		putback_lru_page(page);
> 
> May be 
> 
>      } else 
> 	putback_lru_page(page);

Yes it probably clearer to use an else branch there.

> > +	}
> > +}
> > +
> > +static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
> > +{
> > +	unsigned long addr = migrate->start, i = 0;
> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> > +
> > +	for (; addr < migrate->end;) {
> > +		unsigned long next;
> > +		pgd_t *pgdp;
> > +		pud_t *pudp;
> > +		pmd_t *pmdp;
> > +		pte_t *ptep;
> > +
> > +		pgdp = pgd_offset(mm, addr);
> > +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pudp = pud_offset(pgdp, addr);
> > +		if (!pudp || pud_none(*pudp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pmdp = pmd_offset(pudp, addr);
> > +		next = pmd_addr_end(addr, migrate->end);
> > +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> > +			addr = next;
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +
> > +		/* No need to lock nothing can change from under us */
> > +		ptep = pte_offset_map(pmdp, addr);
> > +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> > +			struct address_space *mapping;
> > +			struct page *newpage, *page;
> > +			swp_entry_t entry;
> > +			int r;
> > +
> > +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!newpage || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +				continue;
> > +			if (pte_none(*ptep) || pte_present(*ptep)) {
> > +				/* This should not happen but be nice */
> > +				migrate->pfns[i] = 0;
> > +				put_page(newpage);
> > +				continue;
> > +			}
> > +			entry = pte_to_swp_entry(*ptep);
> > +			if (!is_migration_entry(entry)) {
> > +				/* This should not happen but be nice */
> > +				migrate->pfns[i] = 0;
> > +				put_page(newpage);
> > +				continue;
> > +			}
> > +
> > +			page = migration_entry_to_page(entry);
> > +			mapping = page_mapping(page);
> > +
> > +			/*
> > +			 * For now only support private anonymous when migrating
> > +			 * to un-addressable device memory.
> > +			 */
> > +			if (mapping && is_zone_device_page(newpage) &&
> > +			    !is_addressable_page(newpage)) {
> > +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +				continue;
> > +			}
> > +
> > +			r = migrate_page(mapping, newpage, page,
> > +					 MIGRATE_SYNC, false);
> > +			if (r != MIGRATEPAGE_SUCCESS)
> > +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +		}
> > +		pte_unmap(ptep - 1);
> > +	}
> > +}
> 
> 
> Why are we walking the page table multiple times ? Is it that after
> alloc_copy the content of migrate->pfns pfn array is now the new pfns ?
> It is confusing that each of these functions walk one page table
> multiple times (even when page can be shared). I was expecting us to
> walk the page table once to collect the pfns/pages and then use that
> in rest of the calls. Any specific reason you choose to implement it
> this way ?

Well you need to know the source and destination page, so either i have
2 arrays one for source page and one for destination pages and then i do
not need to walk page table multiple time. But needing 2 arrays might be
problematic as here we want to migrate reasonable chunk ie few megabyte
hence there is a need for vmalloc.

My advice to device driver was to pre-allocate this array once (maybe
preallocate couple of them). If you really prefer avoiding walking the
CPU page table over and over then i can switch to 2 arrays solutions.
 

> > +
> > +static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
> > +{
> > +	unsigned long addr = migrate->start, i = 0;
> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> > +
> > +	for (; addr < migrate->end;) {
> > +		unsigned long next;
> > +		pgd_t *pgdp;
> > +		pud_t *pudp;
> > +		pmd_t *pmdp;
> > +		pte_t *ptep;
> > +
> > +		pgdp = pgd_offset(mm, addr);
> > +		pudp = pud_offset(pgdp, addr);
> > +		pmdp = pmd_offset(pudp, addr);
> > +		next = pmd_addr_end(addr, migrate->end);
> > +
> > +		/* No need to lock nothing can change from under us */
> > +		ptep = pte_offset_map(pmdp, addr);
> > +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> > +			struct page *page, *newpage;
> > +			swp_entry_t entry;
> > +
> > +			if (pte_none(*ptep) || pte_present(*ptep))
> > +				continue;
> > +			entry = pte_to_swp_entry(*ptep);
> > +			if (!is_migration_entry(entry))
> > +				continue;
> > +
> > +			page = migration_entry_to_page(entry);
> > +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!newpage)
> > +				newpage = page;
> > +			remove_migration_ptes(page, newpage, false);
> > +
> > +			migrate->pfns[i] = 0;
> > +			unlock_page(page);
> > +			migrate->npages--;
> > +
> > +			if (is_zone_device_page(page))
> > +				put_page(page);
> > +			else
> > +				putback_lru_page(page);
> > +
> > +			if (newpage != page) {
> > +				unlock_page(newpage);
> > +				if (is_zone_device_page(newpage))
> > +					put_page(newpage);
> > +				else
> > +					putback_lru_page(newpage);
> > +			}
> > +		}
> > +		pte_unmap(ptep - 1);
> > +	}
> > +}
> > +
> > +/*
> > + * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
> > + *
> > + * @ops: migration callback for allocating destination memory and copying
> > + * @vma: virtual memory area containing the range to be migrated
> > + * @start: start address of the range to migrate (inclusive)
> > + * @end: end address of the range to migrate (exclusive)
> > + * @pfns: array of hmm_pfn_t first containing source pfns then destination
> > + * @private: pointer passed back to each of the callback
> > + * Returns: 0 on success, error code otherwise
> > + *
> > + * This will try to migrate a range of memory using callback to allocate and
> > + * copy memory from source to destination. This function will first collect,
> > + * lock and unmap pages in the range and then call alloc_and_copy() callback
> > + * for device driver to allocate destination memory and copy from source.
> > + *
> > + * Then it will proceed and try to effectively migrate the page (struct page
> > + * metadata) a step that can fail for various reasons. Before updating CPU page
> > + * table it will call finalize_and_map() callback so that device driver can
> > + * inspect what have been successfully migrated and update its own page table
> > + * (this latter aspect is not mandatory and only make sense for some user of
> > + * this API).
> > + *
> > + * Finaly the function update CPU page table and unlock the pages before
> > + * returning 0.
> > + *
> > + * It will return an error code only if one of the argument is invalid.
> > + */
> > +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> > +		    struct vm_area_struct *vma,
> > +		    unsigned long start,
> > +		    unsigned long end,
> > +		    hmm_pfn_t *pfns,
> > +		    void *private)
> > +{
> > +	struct hmm_migrate migrate;
> > +
> > +	/* Sanity check the arguments */
> > +	start &= PAGE_MASK;
> > +	end &= PAGE_MASK;
> > +	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
> > +		return -EINVAL;
> > +	if (!vma || !ops || !pfns || start >= end)
> > +		return -EINVAL;
> > +	if (start < vma->vm_start || start >= vma->vm_end)
> > +		return -EINVAL;
> > +	if (end <= vma->vm_start || end > vma->vm_end)
> > +		return -EINVAL;
> > +
> > +	migrate.start = start;
> > +	migrate.pfns = pfns;
> > +	migrate.npages = 0;
> > +	migrate.end = end;
> > +	migrate.vma = vma;
> > +
> > +	/* Collect, and try to unmap source pages */
> > +	hmm_migrate_collect(&migrate);
> > +	if (!migrate.npages)
> > +		return 0;
> > +
> > +	/* Lock and isolate page */
> > +	hmm_migrate_lock_and_isolate(&migrate);
> > +	if (!migrate.npages)
> > +		return 0;
> > +
> > +	/* Unmap pages */
> > +	hmm_migrate_unmap(&migrate);
> > +	if (!migrate.npages)
> > +		return 0;
> > +
> > +	/*
> > +	 * At this point pages are lock and unmap and thus they have stable
> > +	 * content and can safely be copied to destination memory that is
> > +	 * allocated by the callback.
> > +	 *
> > +	 * Note that migration can fail in hmm_migrate_struct_page() for each
> > +	 * individual page.
> > +	 */
> > +	ops->alloc_and_copy(vma, start, end, pfns, private);
> > +
> > +	/* This does the real migration of struct page */
> > +	hmm_migrate_struct_page(&migrate);
> > +
> > +	ops->finalize_and_map(vma, start, end, pfns, private);
> > +
> > +	/* Unlock and remap pages */
> > +	hmm_migrate_remove_migration_pte(&migrate);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(hmm_vma_migrate);
> > +#endif /* CONFIG_HMM */
> 
> IMHO If we can get each of the above functions documented properly it will
> help with code review. Also if we can avoid that multiple page table
> walk, it will make it closer to the existing migration logic.
> 

What kind of documentation are you looking for ? I thought the high level overview
was enough as none of the function do anything out of the ordinary. Do you want
more inline documation ? Or a more verbose highlevel overview ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-19 17:17     ` Jerome Glisse
@ 2016-11-20 18:21       ` Aneesh Kumar K.V
  2016-11-20 20:06         ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Aneesh Kumar K.V @ 2016-11-20 18:21 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

Jerome Glisse <jglisse@redhat.com> writes:

.....

>> > +
>> > +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
>> > +		*pfns |= write ? HMM_PFN_WRITE : 0;
>> > +		migrate->npages++;
>> > +		get_page(page);
>> > +
>> > +		if (!trylock_page(page)) {
>> > +			set_pte_at(mm, addr, ptep, pte);
>> > +		} else {
>> > +			pte_t swp_pte;
>> > +
>> > +			*pfns |= HMM_PFN_LOCKED;
>> > +
>> > +			entry = make_migration_entry(page, write);
>> > +			swp_pte = swp_entry_to_pte(entry);
>> > +			if (pte_soft_dirty(pte))
>> > +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>> > +			set_pte_at(mm, addr, ptep, swp_pte);
>> > +
>> > +			page_remove_rmap(page, false);
>> > +			put_page(page);
>> > +			pages++;
>> > +		}
>> 
>> If this is an optimization, can we get that as a seperate patch with
>> addtional comments. ? How does take a successful page lock implies it is
>> not a shared mapping ?
>
> It can be a share mapping and that's fine, migration only fail if page is
> pin.
>

In the previous mail you replied above trylock_page() usage is an
optimization for the usual case where the memory is only use in one
process and that no concurrent migration/memory event is happening. 

How did we know that it is only in use by one process. I got the part
that if we can lock, and since we lock the page early, it avoid
concurrent migration. But I am not sure about the use by one process
part. 


>
>> > +	}
>> > +
>> > +	arch_leave_lazy_mmu_mode();
>> > +	pte_unmap_unlock(ptep - 1, ptl);
>> > +
>> > +	/* Only flush the TLB if we actually modified any entries */
>> > +	if (pages)
>> > +		flush_tlb_range(walk->vma, start, end);
>> > +
>> > +	return 0;
>> > +}
>> 
>> 
>> So without the optimization the above function is suppose to raise the
>> refcount and collect all possible pfns tha we can migrate in the array ?
>
> Yes correct, this function collect all page we can migrate in the range.
>

.....

>
>> > +static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
>> > +{
>> > +	unsigned long addr = migrate->start, i = 0;
>> > +	struct mm_struct *mm = migrate->vma->vm_mm;
>> > +	struct vm_area_struct *vma = migrate->vma;
>> > +	unsigned long restore = 0;
>> > +	bool allow_drain = true;
>> > +
>> > +	lru_add_drain();
>> > +
>> > +again:
>> > +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
>> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
>> > +
>> > +		if (!page)
>> > +			continue;
>> > +
>> > +		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
>> > +			lock_page(page);
>> > +			migrate->pfns[i] |= HMM_PFN_LOCKED;
>> > +		}
>> 
>> What does taking a page_lock protect against ? Can we document that ?
>
> This usual page migration process like existing code, page lock protect against
> anyone trying to map the page inside another process or at different address. It
> also block few fs operations. I don't think there is a comprehensive list anywhere
> but i can try to make one.


I was comparing it against the trylock_page() usage above. But I guess
documenting the page lock can be another patch. 


>
>> > +
>> > +		/* ZONE_DEVICE page are not on LRU */
>> > +		if (is_zone_device_page(page))
>> > +			goto check;
>> > +
>> > +		if (!PageLRU(page) && allow_drain) {
>> > +			/* Drain CPU's pagevec so page can be isolated */
>> > +			lru_add_drain_all();
>> > +			allow_drain = false;
>> > +			goto again;
>> > +		}
>> > +
>> > +		if (isolate_lru_page(page)) {
>> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
>> > +			migrate->npages--;
>> > +			put_page(page);
>> > +			restore++;
>> > +		} else
>> > +			/* Drop the reference we took in collect */
>> > +			put_page(page);
>> > +
>> > +check:
>> > +		if (!hmm_migrate_page_check(page, 1)) {
>> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
>> > +			migrate->npages--;
>> > +			restore++;
>> > +		}
>> > +	}
>> > +
> 

.....

>> > +		}
>> > +		pte_unmap_unlock(ptep - 1, ptl);
>> > +
>> > +		addr = restart;
>> > +		i = (addr - migrate->start) >> PAGE_SHIFT;
>> > +		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
>> > +			page = hmm_pfn_to_page(migrate->pfns[i]);
>> > +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
>> > +				continue;
>> > +
>> > +			migrate->pfns[i] = 0;
>> > +			unlock_page(page);
>> > +			restore--;
>> > +
>> > +			if (is_zone_device_page(page)) {
>> > +				put_page(page);
>> > +				continue;
>> > +			}
>> > +
>> > +			putback_lru_page(page);
>> > +		}
>> > +
>> > +		if (!restore)
>> > +			break;
>> > +	}
>> 
>> 
>> All the above restore won't be needed if we didn't do that migration
>> entry setup in the first function right ? We just need to drop the
>> refcount for pages that we failed to isolated ? No need to walk the page
>> table etc ?
>
> Well the migration entry setup is important so that no concurrent migration
> can race with each other, the one that set the migration entry first is the
> one that win in respect of migration. Also the CPU page table entry need to
> be clear so that page content is stable and DMA copy does not miss any data
> left over in some cache.

This is the part i am still tryint to understand. 
hmm_collect_walk_pmd(), did migration entry setup only in one process
page table. So how can it prevent concurrent migration because one could
initiate a migration using the va/mapping of another process.

Isn't that page lock that is prevent concurrent migration ?

........

> 
>> Why are we walking the page table multiple times ? Is it that after
>> alloc_copy the content of migrate->pfns pfn array is now the new pfns ?
>> It is confusing that each of these functions walk one page table
>> multiple times (even when page can be shared). I was expecting us to
>> walk the page table once to collect the pfns/pages and then use that
>> in rest of the calls. Any specific reason you choose to implement it
>> this way ?
>
> Well you need to know the source and destination page, so either i have
> 2 arrays one for source page and one for destination pages and then i do
> not need to walk page table multiple time. But needing 2 arrays might be
> problematic as here we want to migrate reasonable chunk ie few megabyte
> hence there is a need for vmalloc.
>
> My advice to device driver was to pre-allocate this array once (maybe
> preallocate couple of them). If you really prefer avoiding walking the
> CPU page table over and over then i can switch to 2 arrays solutions.
>

Having two array makes it easy to follow the code. But otherwise I guess
documenting the above usage of page table above the function will also
help.

.....

>> IMHO If we can get each of the above functions documented properly it will
>> help with code review. Also if we can avoid that multiple page table
>> walk, it will make it closer to the existing migration logic.
>> 
>
> What kind of documentation are you looking for ? I thought the high level overview
> was enough as none of the function do anything out of the ordinary. Do you want
> more inline documation ? Or a more verbose highlevel overview ?


Inline documentation for functions will be useful. Also if you can split
the hmm_collect_walk_pmd() optimization we discussed above into a
separate patch I guess this will be lot easy to follow.

I still haven't understood why we setup that migration entry early and
that too only on one process page table. If we can explain that as a
separate patch may be it will much easy to follow.

-aneesh

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-20 18:21       ` Aneesh Kumar K.V
@ 2016-11-20 20:06         ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-20 20:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Mark Hairgrove,
	Sherry Cheung, Subhash Gutti

On Sun, Nov 20, 2016 at 11:51:48PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <jglisse@redhat.com> writes:
> 
> .....
> 
> >> > +
> >> > +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> >> > +		*pfns |= write ? HMM_PFN_WRITE : 0;
> >> > +		migrate->npages++;
> >> > +		get_page(page);
> >> > +
> >> > +		if (!trylock_page(page)) {
> >> > +			set_pte_at(mm, addr, ptep, pte);
> >> > +		} else {
> >> > +			pte_t swp_pte;
> >> > +
> >> > +			*pfns |= HMM_PFN_LOCKED;
> >> > +
> >> > +			entry = make_migration_entry(page, write);
> >> > +			swp_pte = swp_entry_to_pte(entry);
> >> > +			if (pte_soft_dirty(pte))
> >> > +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> >> > +			set_pte_at(mm, addr, ptep, swp_pte);
> >> > +
> >> > +			page_remove_rmap(page, false);
> >> > +			put_page(page);
> >> > +			pages++;
> >> > +		}
> >> 
> >> If this is an optimization, can we get that as a seperate patch with
> >> addtional comments. ? How does take a successful page lock implies it is
> >> not a shared mapping ?
> >
> > It can be a share mapping and that's fine, migration only fail if page is
> > pin.
> >
> 
> In the previous mail you replied above trylock_page() usage is an
> optimization for the usual case where the memory is only use in one
> process and that no concurrent migration/memory event is happening. 
> 
> How did we know that it is only in use by one process. I got the part
> that if we can lock, and since we lock the page early, it avoid
> concurrent migration. But I am not sure about the use by one process
> part. 
> 

The mapcount will reflect that and it is handled latter inside unmap
function. The refcount will be check for pin too.

> 
> >
> >> > +	}
> >> > +
> >> > +	arch_leave_lazy_mmu_mode();
> >> > +	pte_unmap_unlock(ptep - 1, ptl);
> >> > +
> >> > +	/* Only flush the TLB if we actually modified any entries */
> >> > +	if (pages)
> >> > +		flush_tlb_range(walk->vma, start, end);
> >> > +
> >> > +	return 0;
> >> > +}
> >> 
> >> 
> >> So without the optimization the above function is suppose to raise the
> >> refcount and collect all possible pfns tha we can migrate in the array ?
> >
> > Yes correct, this function collect all page we can migrate in the range.
> >
> 
> .....
> 
> >
> >> > +static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
> >> > +{
> >> > +	unsigned long addr = migrate->start, i = 0;
> >> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> >> > +	struct vm_area_struct *vma = migrate->vma;
> >> > +	unsigned long restore = 0;
> >> > +	bool allow_drain = true;
> >> > +
> >> > +	lru_add_drain();
> >> > +
> >> > +again:
> >> > +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> >> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> >> > +
> >> > +		if (!page)
> >> > +			continue;
> >> > +
> >> > +		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
> >> > +			lock_page(page);
> >> > +			migrate->pfns[i] |= HMM_PFN_LOCKED;
> >> > +		}
> >> 
> >> What does taking a page_lock protect against ? Can we document that ?
> >
> > This usual page migration process like existing code, page lock protect against
> > anyone trying to map the page inside another process or at different address. It
> > also block few fs operations. I don't think there is a comprehensive list anywhere
> > but i can try to make one.
> 
> 
> I was comparing it against the trylock_page() usage above. But I guess
> documenting the page lock can be another patch. 

Well trylock_page() in collect function happen under a spinlock (page table spinlock)
hence we can't sleep and don't want to spin either.


> 
> >
> >> > +
> >> > +		/* ZONE_DEVICE page are not on LRU */
> >> > +		if (is_zone_device_page(page))
> >> > +			goto check;
> >> > +
> >> > +		if (!PageLRU(page) && allow_drain) {
> >> > +			/* Drain CPU's pagevec so page can be isolated */
> >> > +			lru_add_drain_all();
> >> > +			allow_drain = false;
> >> > +			goto again;
> >> > +		}
> >> > +
> >> > +		if (isolate_lru_page(page)) {
> >> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> >> > +			migrate->npages--;
> >> > +			put_page(page);
> >> > +			restore++;
> >> > +		} else
> >> > +			/* Drop the reference we took in collect */
> >> > +			put_page(page);
> >> > +
> >> > +check:
> >> > +		if (!hmm_migrate_page_check(page, 1)) {
> >> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> >> > +			migrate->npages--;
> >> > +			restore++;
> >> > +		}
> >> > +	}
> >> > +
> > 
> 
> .....
> 
> >> > +		}
> >> > +		pte_unmap_unlock(ptep - 1, ptl);
> >> > +
> >> > +		addr = restart;
> >> > +		i = (addr - migrate->start) >> PAGE_SHIFT;
> >> > +		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
> >> > +			page = hmm_pfn_to_page(migrate->pfns[i]);
> >> > +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> >> > +				continue;
> >> > +
> >> > +			migrate->pfns[i] = 0;
> >> > +			unlock_page(page);
> >> > +			restore--;
> >> > +
> >> > +			if (is_zone_device_page(page)) {
> >> > +				put_page(page);
> >> > +				continue;
> >> > +			}
> >> > +
> >> > +			putback_lru_page(page);
> >> > +		}
> >> > +
> >> > +		if (!restore)
> >> > +			break;
> >> > +	}
> >> 
> >> 
> >> All the above restore won't be needed if we didn't do that migration
> >> entry setup in the first function right ? We just need to drop the
> >> refcount for pages that we failed to isolated ? No need to walk the page
> >> table etc ?
> >
> > Well the migration entry setup is important so that no concurrent migration
> > can race with each other, the one that set the migration entry first is the
> > one that win in respect of migration. Also the CPU page table entry need to
> > be clear so that page content is stable and DMA copy does not miss any data
> > left over in some cache.
> 
> This is the part i am still tryint to understand. 
> hmm_collect_walk_pmd(), did migration entry setup only in one process
> page table. So how can it prevent concurrent migration because one could
> initiate a migration using the va/mapping of another process.
> 

Well hmm_migrate_unmap() will unmap the page in all the process, so before we
call alloc_and_copy(). When alloc_and_copy() is call the page is unmap ie the
mapcount is zero and there is no pin either. Because the page is lock it can
no new mapping to it can be spawn from under us.

This is exactly like existing migration code, the difference is that existing
migration code do not do collect or lock. It expect to get the page locked and
then unmap before trying to migrate.

So ignoring the collect pass and the optimization where i unmap page in the
current process, my logic for migration is otherwise exactly the same as the
existing one.


> Isn't that page lock that is prevent concurrent migration ?

Page lock do prevent concurrent migration yes. But for the collect pass of
my code having the special migration entry is also a important hint that it
is pointless to migrate that page. Moreover the special migration entry do
exist for a reason. It is an indicator in couple place that is important to
have.

> 
> ........
> 
> > 
> >> Why are we walking the page table multiple times ? Is it that after
> >> alloc_copy the content of migrate->pfns pfn array is now the new pfns ?
> >> It is confusing that each of these functions walk one page table
> >> multiple times (even when page can be shared). I was expecting us to
> >> walk the page table once to collect the pfns/pages and then use that
> >> in rest of the calls. Any specific reason you choose to implement it
> >> this way ?
> >
> > Well you need to know the source and destination page, so either i have
> > 2 arrays one for source page and one for destination pages and then i do
> > not need to walk page table multiple time. But needing 2 arrays might be
> > problematic as here we want to migrate reasonable chunk ie few megabyte
> > hence there is a need for vmalloc.
> >
> > My advice to device driver was to pre-allocate this array once (maybe
> > preallocate couple of them). If you really prefer avoiding walking the
> > CPU page table over and over then i can switch to 2 arrays solutions.
> >
> 
> Having two array makes it easy to follow the code. But otherwise I guess
> documenting the above usage of page table above the function will also
> help.
> 
> .....
> 
> >> IMHO If we can get each of the above functions documented properly it will
> >> help with code review. Also if we can avoid that multiple page table
> >> walk, it will make it closer to the existing migration logic.
> >> 
> >
> > What kind of documentation are you looking for ? I thought the high level overview
> > was enough as none of the function do anything out of the ordinary. Do you want
> > more inline documation ? Or a more verbose highlevel overview ?
> 
> 
> Inline documentation for functions will be useful. Also if you can split
> the hmm_collect_walk_pmd() optimization we discussed above into a
> separate patch I guess this will be lot easy to follow.

Ok will do.

> 
> I still haven't understood why we setup that migration entry early and
> that too only on one process page table. If we can explain that as a
> separate patch may be it will much easy to follow.

Well the one process and early is the optimization, i do setup the special
migration entry in all process inside hmm_migrate_unmap(). So my migration
works exactly as existing one except that i optimize the common case where
the page we are interested in is only map in the process we are doing the
migration against.

I will split the optimization as its own patch.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
@ 2016-11-21  0:44   ` Balbir Singh
  2016-11-21  4:53     ` Jerome Glisse
  2016-11-21  6:41   ` Anshuman Khandual
  1 sibling, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  0:44 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Russell King, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin



On 19/11/16 05:18, Jérôme Glisse wrote:
> Only usefull for arch where we support ZONE_DEVICE and where we want to
> also support un-addressable device memory. We need struct page for such
> un-addressable memory. But we should avoid populating the kernel linear
> mapping for the physical address range because there is no real memory
> or anything behind those physical address.
> 
> Hence we need more flags than just knowing if it is device memory or not.
> 


Isn't it better to add a wrapper to arch_add/remove_memory and do those
checks inside and then call arch_add/remove_memory to reduce the churn.
If you need selectively enable MEMORY_UNADDRESSABLE that can be done with
_ARCH_HAS_FEATURE

> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Chris Metcalf <cmetcalf@mellanox.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> ---
>  arch/ia64/mm/init.c            | 19 ++++++++++++++++---
>  arch/powerpc/mm/mem.c          | 18 +++++++++++++++---
>  arch/s390/mm/init.c            | 10 ++++++++--
>  arch/sh/mm/init.c              | 18 +++++++++++++++---
>  arch/tile/mm/init.c            | 10 ++++++++--
>  arch/x86/mm/init_32.c          | 19 ++++++++++++++++---
>  arch/x86/mm/init_64.c          | 19 ++++++++++++++++---
>  include/linux/memory_hotplug.h | 17 +++++++++++++++--
>  kernel/memremap.c              |  4 ++--
>  mm/memory_hotplug.c            |  4 ++--
>  10 files changed, 113 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 1841ef6..95a2fa5 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -645,7 +645,7 @@ mem_init (void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	pg_data_t *pgdat;
>  	struct zone *zone;
> @@ -653,10 +653,17 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> +		zone_for_memory(nid, start, size, ZONE_NORMAL,
> +				flags & MEMORY_DEVICE);
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
>  
>  	if (ret)
> @@ -667,13 +674,19 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (ret)
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 5f84433..e3c0532 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -126,7 +126,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
>  	return -ENODEV;
>  }
>  
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdata;
>  	struct zone *zone;
> @@ -134,6 +134,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int rc;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	pgdata = NODE_DATA(nid);
>  
>  	start = (unsigned long)__va(start);
> @@ -147,18 +153,24 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  
>  	/* this should work for most non-highmem platforms */
>  	zone = pgdata->node_zones +
> -		zone_for_memory(nid, start, size, 0, for_device);
> +		zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
>  
>  	return __add_pages(nid, zone, start_pfn, nr_pages);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
> +	
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
>  
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index f56a39b..4147b87 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -149,7 +149,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	unsigned long normal_end_pfn = PFN_DOWN(memblock_end_of_DRAM());
>  	unsigned long dma_end_pfn = PFN_DOWN(MAX_DMA_ADDRESS);
> @@ -158,6 +158,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages;
>  	int rc, zone_enum;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	rc = vmem_add_mapping(start, size);
>  	if (rc)
>  		return rc;
> @@ -197,7 +203,7 @@ unsigned long memory_block_size_bytes(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	/*
>  	 * There is no hardware or firmware interface which could trigger a
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index 7549186..f72a402 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -485,19 +485,25 @@ void free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	pg_data_t *pgdat;
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	/* We only have ZONE_NORMAL, so this is easy.. */
>  	ret = __add_pages(nid, pgdat->node_zones +
>  			zone_for_memory(nid, start, size, ZONE_NORMAL,
> -			for_device),
> +					flags & MEMORY_DEVICE),
>  			start_pfn, nr_pages);
>  	if (unlikely(ret))
>  		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
> @@ -516,13 +522,19 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (unlikely(ret))
> diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
> index adce254..5fd972c 100644
> --- a/arch/tile/mm/init.c
> +++ b/arch/tile/mm/init.c
> @@ -863,13 +863,19 @@ void __init mem_init(void)
>   * memory to the highmem for now.
>   */
>  #ifndef CONFIG_NEED_MULTIPLE_NODES
> -int arch_add_memory(u64 start, u64 size, bool for_device)
> +int arch_add_memory(u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdata = &contig_page_data;
>  	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	return __add_pages(zone, start_pfn, nr_pages);
>  }
>  
> @@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	/* TODO */
>  	return -EBUSY;
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index cf80590..16a9095 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -816,24 +816,37 @@ void __init mem_init(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdata = NODE_DATA(nid);
>  	struct zone *zone = pgdata->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
> +		zone_for_memory(nid, start, size, ZONE_HIGHMEM,
> +				flags & MEMORY_DEVICE);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	return __add_pages(nid, zone, start_pfn, nr_pages);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	return __remove_pages(zone, start_pfn, nr_pages);
>  }
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 14b9dd7..8c4abb0 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -651,15 +651,22 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
>   * Memory is added always to NORMAL zone. This means you will never get
>   * additional DMA/DMA32 memory.
>   */
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdat = NODE_DATA(nid);
>  	struct zone *zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> +		zone_for_memory(nid, start, size, ZONE_NORMAL,
> +				flags & MEMORY_DEVICE);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	init_memory_mapping(start, start + size);
>  
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
> @@ -956,7 +963,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>  	remove_pagetable(start, end, true);
>  }
>  
> -int __ref arch_remove_memory(u64 start, u64 size)
> +int __ref arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> @@ -965,6 +972,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
>  	struct zone *zone;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	/* With altmap the first mapped page is offset from @start */
>  	altmap = to_vmem_altmap((unsigned long) page);
>  	if (altmap)
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 01033fa..ba9b12e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -103,7 +103,7 @@ extern bool memhp_auto_online;
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern bool is_pageblock_removable_nolock(struct page *page);
> -extern int arch_remove_memory(u64 start, u64 size);
> +extern int arch_remove_memory(u64 start, u64 size, int flags);
>  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
>  	unsigned long nr_pages);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -275,7 +275,20 @@ extern int add_memory(int nid, u64 start, u64 size);
>  extern int add_memory_resource(int nid, struct resource *resource, bool online);
>  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>  		bool for_device);
> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> +
> +/*
> + * For device memory we want more informations than just knowing it is device
				     information
> + * memory. We want to know if we can migrate it (ie it is not storage memory
> + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
> + * memory can not be access by CPU but we still want struct page so that we
			accessed
> + * can use it like regular memory.

Can you please add some details on why -- migration needs them for example?

> + */
> +#define MEMORY_FLAGS_NONE 0
> +#define MEMORY_DEVICE (1 << 0)
> +#define MEMORY_MOVABLE (1 << 1)
> +#define MEMORY_UNADDRESSABLE (1 << 2)
> +
> +extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern void remove_memory(int nid, u64 start, u64 size);
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index b501e39..07665eb 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>  	/* pages are dead and unused, undo the arch mapping */
>  	align_start = res->start & ~(SECTION_SIZE - 1);
>  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
> -	arch_remove_memory(align_start, align_size);
> +	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
>  	pgmap_radix_release(res);
>  	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
> @@ -358,7 +358,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	if (error)
>  		goto err_pfn_remap;
>  
> -	error = arch_add_memory(nid, align_start, align_size, true);
> +	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
>  	if (error)
>  		goto err_add_memory;
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9629273..b2942d7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1386,7 +1386,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
>  	}
>  
>  	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, false);
> +	ret = arch_add_memory(nid, start, size, MEMORY_FLAGS_NONE);
>  
>  	if (ret < 0)
>  		goto error;
> @@ -2205,7 +2205,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> +	arch_remove_memory(start, size, MEMORY_FLAGS_NONE);
>  
>  	try_offline_node(nid);
>  
> 

Balbir Singh.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-11-18 18:18 ` [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
@ 2016-11-21  1:49   ` Balbir Singh
  2016-11-21  4:57     ` Jerome Glisse
  2016-11-21  8:26   ` Anshuman Khandual
  1 sibling, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  1:49 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler



On 19/11/16 05:18, Jérôme Glisse wrote:
> When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
> is holding a reference on it (only device to which the memory belong do).
> Add a callback and call it when that happen so device driver can implement
> their own free page management.
> 

Could you give an example of what their own free page management might look like?

Balbir Singh.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-18 18:18 ` [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable Jérôme Glisse
@ 2016-11-21  2:06   ` Balbir Singh
  2016-11-21  5:05     ` Jerome Glisse
  2016-11-21 11:10     ` Anshuman Khandual
  2016-11-21 10:58   ` Anshuman Khandual
  1 sibling, 2 replies; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  2:06 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler



On 19/11/16 05:18, Jérôme Glisse wrote:
> To allow use of device un-addressable memory inside a process add a
> special swap type. Also add a new callback to handle page fault on
> such entry.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/proc/task_mmu.c       | 10 +++++++-
>  include/linux/memremap.h |  5 ++++
>  include/linux/swap.h     | 18 ++++++++++---
>  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/memremap.c        | 14 ++++++++++
>  mm/Kconfig               | 12 +++++++++
>  mm/memory.c              | 24 +++++++++++++++++
>  mm/mprotect.c            | 12 +++++++++
>  8 files changed, 158 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6909582..0726d39 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  			} else {
>  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
>  			}
> -		} else if (is_migration_entry(swpent))
> +		} else if (is_migration_entry(swpent)) {
>  			page = migration_entry_to_page(swpent);
> +		} else if (is_device_entry(swpent)) {
> +			page = device_entry_to_page(swpent);
> +		}


So the reason there is a device swap entry for a page belonging to a user process is
that it is in the middle of migration or is it always that a swap entry represents
unaddressable memory belonging to a GPU device, but its tracked in the page table
entries of the process.

>  	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
>  							&& pte_none(*pte))) {
>  		page = find_get_entry(vma->vm_file->f_mapping,
> @@ -708,6 +711,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>  
>  		if (is_migration_entry(swpent))
>  			page = migration_entry_to_page(swpent);
> +		if (is_device_entry(swpent))
> +			page = device_entry_to_page(swpent);
>  	}
>  	if (page) {
>  		int mapcount = page_mapcount(page);
> @@ -1191,6 +1196,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		flags |= PM_SWAP;
>  		if (is_migration_entry(entry))
>  			page = migration_entry_to_page(entry);
> +
> +		if (is_device_entry(entry))
> +			page = device_entry_to_page(entry);
>  	}
>  
>  	if (page && !PageAnon(page))
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b6f03e9..d584c74 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -47,6 +47,11 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   */
>  struct dev_pagemap {
>  	void (*free_devpage)(struct page *page, void *data);
> +	int (*fault)(struct vm_area_struct *vma,
> +		     unsigned long addr,
> +		     struct page *page,
> +		     unsigned flags,
> +		     pmd_t *pmdp);
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7e553e1..599cb54 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
>   */
>  
>  /*
> + * Un-addressable device memory support
> + */
> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> +#define SWP_DEVICE_NUM 2
> +#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
> +#else
> +#define SWP_DEVICE_NUM 0
> +#endif
> +
> +/*
>   * NUMA node memory migration support
>   */
>  #ifdef CONFIG_MIGRATION
> @@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
>  #endif
>  
>  #define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> @@ -442,8 +454,8 @@ static inline void show_swap_cache_info(void)
>  {
>  }
>  
> -#define free_swap_and_cache(swp)	is_migration_entry(swp)
> -#define swapcache_prepare(swp)		is_migration_entry(swp)
> +#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
> +#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
>  
>  static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
>  {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3..d1aa425 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> +{
> +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));

Code style checks

> +}
> +
> +static inline bool is_device_entry(swp_entry_t entry)
> +{
> +	int type = swp_type(entry);
> +	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
> +}
> +
> +static inline void make_device_entry_read(swp_entry_t *entry)
> +{
> +	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
> +}
> +
> +static inline bool is_write_device_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
> +}
> +
> +static inline struct page *device_entry_to_page(swp_entry_t entry)
> +{
> +	return pfn_to_page(swp_offset(entry));
> +}
> +
> +int device_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned flags,
> +		       pmd_t *pmdp);
> +#else /* CONFIG_DEVICE_UNADDRESSABLE */
> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline void make_device_entry_read(swp_entry_t *entry)
> +{
> +}
> +
> +static inline bool is_device_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline bool is_write_device_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline struct page *device_entry_to_page(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
> +static inline int device_entry_fault(struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     swp_entry_t entry,
> +				     unsigned flags,
> +				     pmd_t *pmdp)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif /* CONFIG_DEVICE_UNADDRESSABLE */
> +
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index cf83928..0670015 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -18,6 +18,8 @@
>  #include <linux/io.h>
>  #include <linux/mm.h>
>  #include <linux/memory_hotplug.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  #ifndef ioremap_cache
>  /* temporary while we convert existing ioremap_cache users to memremap */
> @@ -200,6 +202,18 @@ void put_zone_device_page(struct page *page)
>  }
>  EXPORT_SYMBOL(put_zone_device_page);
>  
> +int device_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned flags,
> +		       pmd_t *pmdp)
> +{
> +	struct page *page = device_entry_to_page(entry);
> +
> +	return page->pgmap->fault(vma, addr, page, flags, pmdp);
> +}
> +EXPORT_SYMBOL(device_entry_fault);
> +
>  static void pgmap_radix_release(struct resource *res)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
> diff --git a/mm/Kconfig b/mm/Kconfig
> index be0ee11..0a21411 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -704,6 +704,18 @@ config ZONE_DEVICE
>  
>  	  If FS_DAX is enabled, then say Y.
>  
> +config DEVICE_UNADDRESSABLE
> +	bool "Un-addressable device memory (GPU memory, ...)"
> +	depends on ZONE_DEVICE
> +
> +	help
> +	  Allow to create struct page for un-addressable device memory
> +	  ie memory that is only accessible by the device (or group of
> +	  devices).
> +
> +	  This allow to migrate chunk of process memory to device memory
> +	  while that memory is use by the device.
> +
>  config FRAME_VECTOR
>  	bool
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 15f2908..a83d690 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -889,6 +889,21 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  					pte = pte_swp_mksoft_dirty(pte);
>  				set_pte_at(src_mm, addr, src_pte, pte);
>  			}
> +		} else if (is_device_entry(entry)) {
> +			page = device_entry_to_page(entry);
> +
> +			get_page(page);
> +			rss[mm_counter(page)]++;

Why does rss count go up?

> +			page_dup_rmap(page, false);
> +
> +			if (is_write_device_entry(entry) &&
> +			    is_cow_mapping(vm_flags)) {
> +				make_device_entry_read(&entry);
> +				pte = swp_entry_to_pte(entry);
> +				if (pte_swp_soft_dirty(*src_pte))
> +					pte = pte_swp_mksoft_dirty(pte);
> +				set_pte_at(src_mm, addr, src_pte, pte);
> +			}
>  		}
>  		goto out_set_pte;
>  	}
> @@ -1191,6 +1206,12 @@ again:
>  
>  			page = migration_entry_to_page(entry);
>  			rss[mm_counter(page)]--;
> +		} else if (is_device_entry(entry)) {
> +			struct page *page = device_entry_to_page(entry);
> +			rss[mm_counter(page)]--;
> +
> +			page_remove_rmap(page, false);
> +			put_page(page);
>  		}
>  		if (unlikely(!free_swap_and_cache(entry)))
>  			print_bad_pte(vma, addr, ptent, NULL);
> @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
>  			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
> +		} else if (is_device_entry(entry)) {
> +			ret = device_entry_fault(vma, fe->address, entry,
> +						 fe->flags, fe->pmd);

What does device_entry_fault() actually do here?

>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
>  		} else {
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1bc1eb3..70aff3a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  				pages++;
>  			}
> +
> +			if (is_write_device_entry(entry)) {
> +				pte_t newpte;
> +
> +				make_device_entry_read(&entry);
> +				newpte = swp_entry_to_pte(entry);
> +				if (pte_swp_soft_dirty(oldpte))
> +					newpte = pte_swp_mksoft_dirty(newpte);
> +				set_pte_at(mm, addr, pte, newpte);
> +
> +				pages++;
> +			}

Does it make sense to call mprotect() on device memory ranges?

>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	arch_leave_lazy_mmu_mode();
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2016-11-18 18:18 ` [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
@ 2016-11-21  2:08   ` Balbir Singh
  2016-11-21  5:08     ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  2:08 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Thomas Gleixner, Ingo Molnar, H. Peter Anvin



On 19/11/16 05:18, Jérôme Glisse wrote:
> It does not need much, just skip populating kernel linear mapping
> for range of un-addressable device memory (it is pick so that there
> is no physical memory resource overlapping it). All the logic is in
> share mm code.
> 
> Only support x86-64 as this feature doesn't make much sense with
> constrained virtual address space of 32bits architecture.
> 

Is there a reason this would not work on powerpc64 for example?
Could you document the limitations -- testing/APIs/missing features?

Balbir Singh.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-18 18:18 ` [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
@ 2016-11-21  2:29   ` Balbir Singh
  2016-11-21  5:14     ` Jerome Glisse
  2016-11-23  4:03   ` Anshuman Khandual
  1 sibling, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  2:29 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jatin Kumar, Mark Hairgrove, Sherry Cheung, Subhash Gutti



On 19/11/16 05:18, Jérôme Glisse wrote:
> HMM provides 3 separate functionality :
>     - Mirroring: synchronize CPU page table and device page table
>     - Device memory: allocating struct page for device memory
>     - Migration: migrating regular memory to device memory
> 
> This patch introduces some common helpers and definitions to all of
> those 3 functionality.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  MAINTAINERS              |   7 +++
>  include/linux/hmm.h      | 139 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h |   5 ++
>  kernel/fork.c            |   2 +
>  mm/Kconfig               |  11 ++++
>  mm/Makefile              |   1 +
>  mm/hmm.c                 |  86 +++++++++++++++++++++++++++++
>  7 files changed, 251 insertions(+)
>  create mode 100644 include/linux/hmm.h
>  create mode 100644 mm/hmm.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f593300..41cd63d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5582,6 +5582,13 @@ S:	Supported
>  F:	drivers/scsi/hisi_sas/
>  F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
>  
> +HMM - Heterogeneous Memory Management
> +M:	Jérôme Glisse <jglisse@redhat.com>
> +L:	linux-mm@kvack.org
> +S:	Maintained
> +F:	mm/hmm*
> +F:	include/linux/hmm*
> +
>  HOST AP DRIVER
>  M:	Jouni Malinen <j@w1.fi>
>  L:	hostap@shmoo.com (subscribers-only)
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> new file mode 100644
> index 0000000..54dd529
> --- /dev/null
> +++ b/include/linux/hmm.h
> @@ -0,0 +1,139 @@
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +/*
> + * HMM provides 3 separate functionality :
> + *   - Mirroring: synchronize CPU page table and device page table
> + *   - Device memory: allocating struct page for device memory
> + *   - Migration: migrating regular memory to device memory
> + *
> + * Each can be use independently from the others.
> + *
> + *
> + * Mirroring:
> + *
> + * HMM provide helpers to mirror process address space on a device. For this it
> + * provides several helpers to order device page table update in respect to CPU
> + * page table update. Requirement is that for any given virtual address the CPU
> + * and device page table can not point to different physical page. It uses the
> + * mmu_notifier API and introduce virtual address range lock which block CPU
> + * page table update for a range while the device page table is being updated.
> + * Usage pattern is:
> + *
> + *      hmm_vma_range_lock(vma, start, end);
> + *      // snap shot CPU page table
> + *      // update device page table from snapshot
> + *      hmm_vma_range_unlock(vma, start, end);
> + *
> + * Any CPU page table update that conflict with a range lock will wait until
> + * range is unlock. This garanty proper serialization of CPU and device page
> + * table update.
> + *
> + *
> + * Device memory:
> + *
> + * HMM provides helpers to help leverage device memory either addressable like
> + * regular memory by the CPU or un-addressable at all. In both case the device
> + * memory is associated to dedicated structs page (which are allocated like for
> + * hotplug memory). Device memory management is under the responsability of the
> + * device driver. HMM only allocate and initialize the struct pages associated
> + * with the device memory.
> + *
> + * Allocating struct page for device memory allow to use device memory allmost
> + * like any regular memory. Unlike regular memory it can not be added to the
> + * lru, nor can any memory allocation can use device memory directly. Device
> + * memory will only end up to be use in a process if device driver migrate some
				   in use 
> + * of the process memory from regular memory to device memory.
> + *

A process can never directly allocate device memory?

> + *
> + * Migration:
> + *
> + * Existing memory migration mechanism (mm/migrate.c) does not allow to use
> + * something else than the CPU to copy from source to destination memory. More
> + * over existing code is not tailor to drive migration from process virtual
				tailored
> + * address rather than from list of pages. Finaly the migration flow does not
					      Finally 
> + * allow for graceful failure at different step of the migration process.
> + *
> + * HMM solves all of the above though simple API :
> + *
> + *      hmm_vma_migrate(vma, start, end, ops);
> + *
> + * With ops struct providing 2 callback alloc_and_copy() which allocated the
> + * destination memory and initialize it using source memory. Migration can fail
> + * after this step and thus last callback finalize_and_map() allow the device
> + * driver to know which page were successfully migrated and which were not.
> + *
> + * This can easily be use outside of HMM intended use case.
> + *

I think it is a good API to have

> + *
> + * This header file contain all the API related to this 3 functionality and
> + * each functions and struct are more thouroughly documented in below comments.
> + */
> +#ifndef LINUX_HMM_H
> +#define LINUX_HMM_H
> +
> +#include <linux/kconfig.h>
> +
> +#if IS_ENABLED(CONFIG_HMM)
> +
> +
> +/*
> + * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
		      uses
> + *
> + * Flags:
> + * HMM_PFN_VALID: pfn is valid
> + * HMM_PFN_WRITE: CPU page table have the write permission set
				    has
> + */
> +typedef unsigned long hmm_pfn_t;
> +
> +#define HMM_PFN_VALID (1 << 0)
> +#define HMM_PFN_WRITE (1 << 1)
> +#define HMM_PFN_SHIFT 2
> +
> +static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
> +{
> +	if (!(pfn & HMM_PFN_VALID))
> +		return NULL;
> +	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> +}
> +
> +static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
> +{
> +	if (!(pfn & HMM_PFN_VALID))
> +		return -1UL;
> +	return (pfn >> HMM_PFN_SHIFT);
> +}
> +

What is pfn_to_pfn? I presume it means CPU PFN to device PFN
or is it the reverse? Please add some comments

> +static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
> +{
> +	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +}
> +
> +static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
> +{
> +	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +}
> +

Same as above

> +
> +/* Below are for HMM internal use only ! Not to be use by device driver ! */
> +void hmm_mm_destroy(struct mm_struct *mm);
> +
> +#else /* IS_ENABLED(CONFIG_HMM) */
> +
> +/* Below are for HMM internal use only ! Not to be use by device driver ! */
> +static inline void hmm_mm_destroy(struct mm_struct *mm) {}
> +
> +#endif /* IS_ENABLED(CONFIG_HMM) */
> +#endif /* LINUX_HMM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4a8aced..4effdbf 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -23,6 +23,7 @@
>  
>  struct address_space;
>  struct mem_cgroup;
> +struct hmm;
>  
>  #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
>  #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
> @@ -516,6 +517,10 @@ struct mm_struct {
>  	atomic_long_t hugetlb_usage;
>  #endif
>  	struct work_struct async_put_work;
> +#if IS_ENABLED(CONFIG_HMM)
> +	/* HMM need to track few things per mm */
> +	struct hmm *hmm;
> +#endif
>  };
>  
>  static inline void mm_init_cpumask(struct mm_struct *mm)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 690a1aad..af0eec8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -27,6 +27,7 @@
>  #include <linux/binfmts.h>
>  #include <linux/mman.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/hmm.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/vmacache.h>
> @@ -702,6 +703,7 @@ void __mmdrop(struct mm_struct *mm)
>  	BUG_ON(mm == &init_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> +	hmm_mm_destroy(mm);
>  	mmu_notifier_mm_destroy(mm);
>  	check_mm(mm);
>  	free_mm(mm);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0a21411..be18cc2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -289,6 +289,17 @@ config MIGRATION
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  	bool
>  
> +config HMM
> +	bool "Heterogeneous memory management (HMM)"
> +	depends on MMU
> +	default n
> +	help
> +	  Heterogeneous memory management, set of helpers for:
> +	    - mirroring of process address space on a device
> +	    - using device memory transparently inside a process
> +
> +	  If unsure, say N to disable HMM.
> +

It would be nice to split this into HMM, HMM_MIGRATE and HMM_MIRROR

>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 2ca1faf..6ac1284 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -76,6 +76,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_HMM) += hmm.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/hmm.c b/mm/hmm.c
> new file mode 100644
> index 0000000..342b596
> --- /dev/null
> +++ b/mm/hmm.c
> @@ -0,0 +1,86 @@
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +/*
> + * Refer to include/linux/hmm.h for informations about heterogeneous memory
> + * management or HMM for short.
> + */
> +#include <linux/mm.h>
> +#include <linux/hmm.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +
> +/*
> + * struct hmm - HMM per mm struct
> + *
> + * @mm: mm struct this HMM struct is bound to
> + */
> +struct hmm {
> +	struct mm_struct	*mm;
> +};
> +
> +/*
> + * hmm_register - register HMM against an mm (HMM internal)
> + *
> + * @mm: mm struct to attach to
> + *
> + * This is not intended to be use directly by device driver but by other HMM
> + * component. It allocates an HMM struct if mm does not have one and initialize
> + * it.
> + */
> +static struct hmm *hmm_register(struct mm_struct *mm)
> +{
> +	struct hmm *hmm = NULL;
> +
> +	if (!mm->hmm) {
> +		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
> +		if (!hmm)
> +			return NULL;
> +		hmm->mm = mm;
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (!mm->hmm)
> +		/*
> +		 * The hmm struct can only be free once mm_struct goes away
> +		 * hence we should always have pre-allocated an new hmm struct
> +		 * above.
> +		 */
> +		mm->hmm = hmm;
> +	else if (hmm)
> +		kfree(hmm);
> +	hmm = mm->hmm;
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return hmm;
> +}
> +
> +void hmm_mm_destroy(struct mm_struct *mm)
> +{
> +	struct hmm *hmm;
> +
> +	/*
> +	 * We should not need to lock here as no one should be able to register
> +	 * a new HMM while an mm is being destroy. But just to be safe ...
> +	 */
> +	spin_lock(&mm->page_table_lock);
> +	hmm = mm->hmm;
> +	mm->hmm = NULL;
> +	spin_unlock(&mm->page_table_lock);
> +	if (!hmm)
> +		return;
> +

kfree can deal with NULL pointers, you can remove the if check

> +	kfree(hmm);
> +}
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2016-11-18 18:18 ` [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
@ 2016-11-21  2:42   ` Balbir Singh
  2016-11-21  5:18     ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  2:42 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jatin Kumar, Mark Hairgrove, Sherry Cheung, Subhash Gutti



On 19/11/16 05:18, Jérôme Glisse wrote:
> This is a heterogeneous memory management (HMM) process address space
> mirroring. In a nutshell this provide an API to mirror process address
> space on a device. This boils down to keeping CPU and device page table
> synchronize (we assume that both device and CPU are cache coherent like
> PCIe device can be).
> 
> This patch provide a simple API for device driver to achieve address
> space mirroring thus avoiding each device driver to grow its own CPU
> page table walker and its own CPU page table synchronization mechanism.
> 
> This is usefull for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
	   useful
> hardware in the future.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  include/linux/hmm.h |  97 +++++++++++++++++++++++++++++++
>  mm/hmm.c            | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 257 insertions(+)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 54dd529..f44e270 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -88,6 +88,7 @@
>  
>  #if IS_ENABLED(CONFIG_HMM)
>  
> +struct hmm;
>  
>  /*
>   * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
> @@ -127,6 +128,102 @@ static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
>  }
>  
>  
> +/*
> + * Mirroring: how to use synchronize device page table with CPU page table ?
> + *
> + * Device driver must always synchronize with CPU page table update, for this
> + * they can either directly use mmu_notifier API or they can use the hmm_mirror
> + * API. Device driver can decide to register one mirror per device per process
> + * or just one mirror per process for a group of device. Pattern is :
> + *
> + *      int device_bind_address_space(..., struct mm_struct *mm, ...)
> + *      {
> + *          struct device_address_space *das;
> + *          int ret;
> + *          // Device driver specific initialization, and allocation of das
> + *          // which contain an hmm_mirror struct as one of its field.
> + *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
> + *          if (ret) {
> + *              // Cleanup on error
> + *              return ret;
> + *          }
> + *          // Other device driver specific initialization
> + *      }
> + *
> + * Device driver must not free the struct containing hmm_mirror struct before
> + * calling hmm_mirror_unregister() expected usage is to do that when device
> + * driver is unbinding from an address space.
> + *
> + *      void device_unbind_address_space(struct device_address_space *das)
> + *      {
> + *          // Device driver specific cleanup
> + *          hmm_mirror_unregister(&das->mirror);
> + *          // Other device driver specific cleanup and now das can be free
> + *      }
> + *
> + * Once an hmm_mirror is register for an address space, device driver will get
> + * callback through the update() operation (see hmm_mirror_ops struct).
> + */
> +
> +struct hmm_mirror;
> +
> +/*
> + * enum hmm_update - type of update
> + * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
> + */
> +enum hmm_update {
> +	HMM_UPDATE_INVALIDATE,
> +};
> +
> +/*
> + * struct hmm_mirror_ops - HMM mirror device operations callback
> + *
> + * @update: callback to update range on a device
> + */
> +struct hmm_mirror_ops {
> +	/* update() - update virtual address range of memory
> +	 *
> +	 * @mirror: pointer to struct hmm_mirror
> +	 * @update: update's type (turn read only, unmap, ...)
> +	 * @start: virtual start address of the range to update
> +	 * @end: virtual end address of the range to update
> +	 *
> +	 * This callback is call when the CPU page table is updated, the device
> +	 * driver must update device page table accordingly to update's action.
> +	 *
> +	 * Device driver callback must wait until device have fully updated its
> +	 * view for the range. Note we plan to make this asynchronous in later
> +	 * patches. So that multiple devices can schedule update to their page
> +	 * table and once all device have schedule the update then we wait for
> +	 * them to propagate.
> +	 */
> +	void (*update)(struct hmm_mirror *mirror,
> +		       enum hmm_update action,
> +		       unsigned long start,
> +		       unsigned long end);
> +};
> +
> +/*
> + * struct hmm_mirror - mirror struct for a device driver
> + *
> + * @hmm: pointer to struct hmm (which is unique per mm_struct)
> + * @ops: device driver callback for HMM mirror operations
> + * @list: for list of mirrors of a given mm
> + *
> + * Each address space (mm_struct) being mirrored by a device must register one
> + * of hmm_mirror struct with HMM. HMM will track list of all mirrors for each
> + * mm_struct (or each process).
> + */
> +struct hmm_mirror {
> +	struct hmm			*hmm;
> +	const struct hmm_mirror_ops	*ops;
> +	struct list_head		list;
> +};
> +
> +int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
> +void hmm_mirror_unregister(struct hmm_mirror *mirror);
> +
> +
>  /* Below are for HMM internal use only ! Not to be use by device driver ! */
>  void hmm_mm_destroy(struct mm_struct *mm);
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 342b596..3594785 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -21,14 +21,27 @@
>  #include <linux/hmm.h>
>  #include <linux/slab.h>
>  #include <linux/sched.h>
> +#include <linux/mmu_notifier.h>
>  
>  /*
>   * struct hmm - HMM per mm struct
>   *
>   * @mm: mm struct this HMM struct is bound to
> + * @lock: lock protecting mirrors list
> + * @mirrors: list of mirrors for this mm
> + * @wait_queue: wait queue
> + * @sequence: we track update to CPU page table with a sequence number
> + * @mmu_notifier: mmu notifier to track update to CPU page table
> + * @notifier_count: number of currently active notifier count
>   */
>  struct hmm {
>  	struct mm_struct	*mm;
> +	spinlock_t		lock;
> +	struct list_head	mirrors;
> +	atomic_t		sequence;
> +	wait_queue_head_t	wait_queue;
> +	struct mmu_notifier	mmu_notifier;
> +	atomic_t		notifier_count;
>  };
>  
>  /*
> @@ -48,6 +61,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
>  		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>  		if (!hmm)
>  			return NULL;
> +		init_waitqueue_head(&hmm->wait_queue);
> +		atomic_set(&hmm->notifier_count, 0);
> +		INIT_LIST_HEAD(&hmm->mirrors);
> +		atomic_set(&hmm->sequence, 0);
> +		hmm->mmu_notifier.ops = NULL;
> +		spin_lock_init(&hmm->lock);
>  		hmm->mm = mm;
>  	}
>  
> @@ -84,3 +103,144 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  
>  	kfree(hmm);
>  }
> +
> +
> +
> +static void hmm_invalidate_range(struct hmm *hmm,
> +				 enum hmm_update action,
> +				 unsigned long start,
> +				 unsigned long end)
> +{
> +	struct hmm_mirror *mirror;
> +
> +	/*
> +	 * Mirror being added or remove is a rare event so list traversal isn't
> +	 * protected by a lock, we rely on simple rules. All list modification
> +	 * are done using list_add_rcu() and list_del_rcu() under a spinlock to
> +	 * protect from concurrent addition or removal but not traversal.
> +	 *
> +	 * Because hmm_mirror_unregister() wait for all running invalidation to
> +	 * complete (and thus all list traversal to finish). None of the mirror
> +	 * struct can be freed from under us while traversing the list and thus
> +	 * it is safe to dereference their list pointer even if they were just
> +	 * remove.
> +	 */
> +	list_for_each_entry (mirror, &hmm->mirrors, list)
> +		mirror->ops->update(mirror, action, start, end);
> +}
> +
> +static void hmm_invalidate_page(struct mmu_notifier *mn,
> +				struct mm_struct *mm,
> +				unsigned long addr)
> +{
> +	unsigned long start = addr & PAGE_MASK;
> +	unsigned long end = start + PAGE_SIZE;
> +	struct hmm *hmm = mm->hmm;
> +
> +	VM_BUG_ON(!hmm);
> +
> +	atomic_inc(&hmm->notifier_count);
> +	atomic_inc(&hmm->sequence);
> +	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
> +	atomic_dec(&hmm->notifier_count);
> +	wake_up(&hmm->wait_queue);
> +}
> +
> +static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +				       struct mm_struct *mm,
> +				       unsigned long start,
> +				       unsigned long end)
> +{
> +	struct hmm *hmm = mm->hmm;
> +
> +	VM_BUG_ON(!hmm);
> +
> +	atomic_inc(&hmm->notifier_count);
> +	atomic_inc(&hmm->sequence);
> +	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
> +}
> +
> +static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> +				     struct mm_struct *mm,
> +				     unsigned long start,
> +				     unsigned long end)
> +{
> +	struct hmm *hmm = mm->hmm;
> +
> +	VM_BUG_ON(!hmm);
> +
> +	/* Reverse order here because we are getting out of invalidation */
> +	atomic_dec(&hmm->notifier_count);
> +	wake_up(&hmm->wait_queue);
> +}
> +
> +static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> +	.invalidate_page	= hmm_invalidate_page,
> +	.invalidate_range_start	= hmm_invalidate_range_start,
> +	.invalidate_range_end	= hmm_invalidate_range_end,
> +};
> +
> +/*
> + * hmm_mirror_register() - register a mirror against an mm
> + *
> + * @mirror: new mirror struct to register
> + * @mm: mm to register against
> + *
> + * To start mirroring a process address space device driver must register an
> + * HMM mirror struct.
> + */
> +int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> +{
> +	/* Sanity check */
> +	if (!mm || !mirror || !mirror->ops)
> +		return -EINVAL;
> +
> +	mirror->hmm = hmm_register(mm);
> +	if (!mirror->hmm)
> +		return -ENOMEM;
> +
> +	/* Register mmu_notifier if not already, use mmap_sem for locking */
> +	if (!mirror->hmm->mmu_notifier.ops) {
> +		struct hmm *hmm = mirror->hmm;
> +		down_write(&mm->mmap_sem);
> +		if (!hmm->mmu_notifier.ops) {
> +			hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> +			if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
> +				hmm->mmu_notifier.ops = NULL;
> +				up_write(&mm->mmap_sem);
> +				return -ENOMEM;
> +			}
> +		}
> +		up_write(&mm->mmap_sem);
> +	}

Does everything get mirrored, every update to the PTE (clear dirty, clear
accessed bit, etc) or does the driver decide?

> +
> +	spin_lock(&mirror->hmm->lock);
> +	list_add_rcu(&mirror->list, &mirror->hmm->mirrors);
> +	spin_unlock(&mirror->hmm->lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_mirror_register);
> +
> +/*
> + * hmm_mirror_unregister() - unregister a mirror
> + *
> + * @mirror: new mirror struct to register
> + *
> + * Stop mirroring a process address space and cleanup.
> + */
> +void hmm_mirror_unregister(struct hmm_mirror *mirror)
> +{
> +	struct hmm *hmm = mirror->hmm;
> +
> +	spin_lock(&hmm->lock);
> +	list_del_rcu(&mirror->list);
> +	spin_unlock(&hmm->lock);
> +
> +	/*
> +	 * Wait for all active notifier so that it is safe to traverse mirror
> +	 * list without any lock.
> +	 */
> +	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
> +}
> +EXPORT_SYMBOL(hmm_mirror_unregister);
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
  2016-11-18 19:57   ` Aneesh Kumar K.V
  2016-11-19 14:32   ` Aneesh Kumar K.V
@ 2016-11-21  3:30   ` Balbir Singh
  2016-11-21  5:31     ` Jerome Glisse
  2 siblings, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-21  3:30 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jatin Kumar, Mark Hairgrove, Sherry Cheung, Subhash Gutti



On 19/11/16 05:18, Jérôme Glisse wrote:
> This patch add a new memory migration helpers, which migrate memory
             adds                       helper         migrates
> backing a range of virtual address of a process to different memory
> (which can be allocated through special allocator). It differs from
> numa migration by working on a range of virtual address and thus by
> doing migration in chunk that can be large enough to use DMA engine
> or special copy offloading engine.
> 
> Expected users are any one with heterogeneous memory where different
> memory have different characteristics (latency, bandwidth, ...). As
> an example IBM platform with CAPI bus can make use of this feature
> to migrate between regular memory and CAPI device memory. New CPU
> architecture with a pool of high performance memory not manage as
> cache but presented as regular memory (while being faster and with
> lower latency than DDR) will also be prime user of this patch.
> 
> Migration to private device memory will be usefull for device that
> have large pool of such like GPU, NVidia plans to use HMM for that.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  include/linux/hmm.h |  54 ++++-
>  mm/migrate.c        | 584 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 635 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index c79abfc..9777309 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -101,10 +101,13 @@ struct hmm;
>   * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
>   * HMM_PFN_FAULT: use by hmm_vma_fault() to signify which address need faulting
>   * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
> + * HMM_PFN_LOCKED: underlying struct page is lock
>   * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
>   *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
>   *      device (the entry will never have HMM_PFN_VALID set and the pfn value
>   *      is undefine)
> + * HMM_PFN_MIGRATE: use by hmm_vma_migrate() to signify which address can be
> + *      migrated
>   * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
>   */
>  typedef unsigned long hmm_pfn_t;
> @@ -116,9 +119,11 @@ typedef unsigned long hmm_pfn_t;
>  #define HMM_PFN_EMPTY (1 << 4)
>  #define HMM_PFN_FAULT (1 << 5)
>  #define HMM_PFN_DEVICE (1 << 6)
> -#define HMM_PFN_SPECIAL (1 << 7)
> -#define HMM_PFN_UNADDRESSABLE (1 << 8)
> -#define HMM_PFN_SHIFT 9
> +#define HMM_PFN_LOCKED (1 << 7)
> +#define HMM_PFN_SPECIAL (1 << 8)
> +#define HMM_PFN_MIGRATE (1 << 9)
> +#define HMM_PFN_UNADDRESSABLE (1 << 10)
> +#define HMM_PFN_SHIFT 11
>  
>  static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
>  {
> @@ -323,6 +328,49 @@ bool hmm_vma_fault(struct vm_area_struct *vma,
>  		   hmm_pfn_t *pfns);
>  
>  
> +/*
> + * struct hmm_migrate_ops - migrate operation callback
> + *
> + * @alloc_and_copy: alloc destination memoiry and copy source to it
> + * @finalize_and_map: allow caller to inspect successfull migrated page
> + *
> + * The new HMM migrate helper hmm_vma_migrate() allow memory migration to use
> + * device DMA engine to perform copy from source to destination memory it also
> + * allow caller to use its own memory allocator for destination memory.
> + *
> + * Note that in alloc_and_copy device driver can decide not to migrate some of
> + * the entry, for those it must clear the HMM_PFN_MIGRATE flag. The destination
> + * page must lock and the corresponding hmm_pfn_t value in the array updated
> + * with the HMM_PFN_MIGRATE and HMM_PFN_LOCKED flag set (and of course be a
> + * valid entry). It is expected that the page allocated will have an elevated
> + * refcount and that a put_page() will free the page. Device driver might want
> + * to allocate with an extra-refcount if they want to control deallocation of
> + * failed migration inside the finalize_and_map() callback.
> + *
> + * Inside finalize_and_map() device driver must use the HMM_PFN_MIGRATE flag to
> + * determine which page have been successfully migrated.
> + */
> +struct hmm_migrate_ops {
> +	void (*alloc_and_copy)(struct vm_area_struct *vma,
> +			       unsigned long start,
> +			       unsigned long end,
> +			       hmm_pfn_t *pfns,
> +			       void *private);
> +	void (*finalize_and_map)(struct vm_area_struct *vma,
> +				 unsigned long start,
> +				 unsigned long end,
> +				 hmm_pfn_t *pfns,
> +				 void *private);
> +};
> +
> +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> +		    struct vm_area_struct *vma,
> +		    unsigned long start,
> +		    unsigned long end,
> +		    hmm_pfn_t *pfns,
> +		    void *private);
> +
> +
>  /* Below are for HMM internal use only ! Not to be use by device driver ! */
>  void hmm_mm_destroy(struct mm_struct *mm);
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index d9ce8db..393d592 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -41,6 +41,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/page_owner.h>
>  #include <linux/memremap.h>
> +#include <linux/hmm.h>
>  
>  #include <asm/tlbflush.h>
>  
> @@ -421,6 +422,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  	int expected_count = 1 + extra_count;
>  	void **pslot;
>  
> +	/*
> +	 * ZONE_DEVICE pages have 1 refcount always held by their device
> +	 *
> +	 * Note that DAX memory will never reach that point as it does not have
> +	 * the MEMORY_MOVABLE flag set (see include/linux/memory_hotplug.h).
> +	 */
> +	expected_count += is_zone_device_page(page);
> +
>  	if (!mapping) {
>  		/* Anonymous page without mapping */
>  		if (page_count(page) != expected_count)
> @@ -2087,3 +2096,578 @@ out_unlock:
>  #endif /* CONFIG_NUMA_BALANCING */
>  
>  #endif /* CONFIG_NUMA */
> +
> +
> +#if defined(CONFIG_HMM)
> +struct hmm_migrate {
> +	struct vm_area_struct	*vma;
> +	unsigned long		start;
> +	unsigned long		end;
> +	unsigned long		npages;
> +	hmm_pfn_t		*pfns;

I presume the destination is pfns[] or is the source?

> +};
> +
> +static int hmm_collect_walk_pmd(pmd_t *pmdp,
> +				unsigned long start,
> +				unsigned long end,
> +				struct mm_walk *walk)
> +{
> +	struct hmm_migrate *migrate = walk->private;
> +	struct mm_struct *mm = walk->vma->vm_mm;
> +	unsigned long addr = start;
> +	spinlock_t *ptl;
> +	hmm_pfn_t *pfns;
> +	int pages = 0;
> +	pte_t *ptep;
> +
> +again:
> +	if (pmd_none(*pmdp))
> +		return 0;
> +
> +	split_huge_pmd(walk->vma, pmdp, addr);
> +	if (pmd_trans_unstable(pmdp))
> +		goto again;
> +

OK., so we always split THP before migration


> +	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +
> +	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
> +		unsigned long pfn;
> +		swp_entry_t entry;
> +		struct page *page;
> +		hmm_pfn_t flags;
> +		bool write;
> +		pte_t pte;
> +
> +		pte = ptep_get_and_clear(mm, addr, ptep);
> +		if (!pte_present(pte)) {
> +			if (pte_none(pte))
> +				continue;
> +
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_device_entry(entry)) {
> +				set_pte_at(mm, addr, ptep, pte);

Why hard code this, in general the ability to migrate a VMA
start/end range seems like a useful API.

> +				continue;
> +			}
> +
> +			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;

Currently UNADDRESSABLE?

> +			page = device_entry_to_page(entry);
> +			write = is_write_device_entry(entry);
> +			pfn = page_to_pfn(page);
> +
> +			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
> +				set_pte_at(mm, addr, ptep, pte);
> +				continue;
> +			}
> +
> +		} else {
> +			pfn = pte_pfn(pte);
> +			page = pfn_to_page(pfn);
> +			write = pte_write(pte);
> +			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
> +		}
> +
> +		/* FIXME support THP see hmm_migrate_page_check() */
> +		if (PageTransCompound(page))
> +			continue;

Didn't we split the THP above?

> +
> +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> +		*pfns |= write ? HMM_PFN_WRITE : 0;
> +		migrate->npages++;
> +		get_page(page);
> +
> +		if (!trylock_page(page)) {
> +			set_pte_at(mm, addr, ptep, pte);

put_page()?

> +		} else {
> +			pte_t swp_pte;
> +
> +			*pfns |= HMM_PFN_LOCKED;
> +
> +			entry = make_migration_entry(page, write);
> +			swp_pte = swp_entry_to_pte(entry);
> +			if (pte_soft_dirty(pte))
> +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +			set_pte_at(mm, addr, ptep, swp_pte);
> +
> +			page_remove_rmap(page, false);
> +			put_page(page);
> +			pages++;
> +		}
> +	}
> +
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(ptep - 1, ptl);
> +
> +	/* Only flush the TLB if we actually modified any entries */
> +	if (pages)
> +		flush_tlb_range(walk->vma, start, end);
> +
> +	return 0;
> +}
> +
> +static void hmm_migrate_collect(struct hmm_migrate *migrate)
> +{
> +	struct mm_walk mm_walk;
> +
> +	mm_walk.pmd_entry = hmm_collect_walk_pmd;
> +	mm_walk.pte_entry = NULL;
> +	mm_walk.pte_hole = NULL;
> +	mm_walk.hugetlb_entry = NULL;
> +	mm_walk.test_walk = NULL;
> +	mm_walk.vma = migrate->vma;
> +	mm_walk.mm = migrate->vma->vm_mm;
> +	mm_walk.private = migrate;
> +
> +	mmu_notifier_invalidate_range_start(mm_walk.mm,
> +					    migrate->start,
> +					    migrate->end);
> +	walk_page_range(migrate->start, migrate->end, &mm_walk);
> +	mmu_notifier_invalidate_range_end(mm_walk.mm,
> +					  migrate->start,
> +					  migrate->end);
> +}
> +
> +static inline bool hmm_migrate_page_check(struct page *page, int extra)
> +{
> +	/*
> +	 * FIXME support THP (transparent huge page), it is bit more complex to
> +	 * check them then regular page because they can be map with a pmd or
> +	 * with a pte (split pte mapping).
> +	 */
> +	if (PageCompound(page))
> +		return false;

PageTransCompound()?

> +
> +	if (is_zone_device_page(page))
> +		extra++;
> +
> +	if ((page_count(page) - extra) > page_mapcount(page))
> +		return false;
> +
> +	return true;
> +}
> +
> +static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
> +{
> +	unsigned long addr = migrate->start, i = 0;
> +	struct mm_struct *mm = migrate->vma->vm_mm;
> +	struct vm_area_struct *vma = migrate->vma;
> +	unsigned long restore = 0;
> +	bool allow_drain = true;
> +
> +	lru_add_drain();
> +
> +again:
> +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +
> +		if (!page)
> +			continue;
> +
> +		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
> +			lock_page(page);
> +			migrate->pfns[i] |= HMM_PFN_LOCKED;
> +		}
> +
> +		/* ZONE_DEVICE page are not on LRU */
> +		if (is_zone_device_page(page))
> +			goto check;
> +
> +		if (!PageLRU(page) && allow_drain) {
> +			/* Drain CPU's pagevec so page can be isolated */
> +			lru_add_drain_all();
> +			allow_drain = false;
> +			goto again;
> +		}
> +
> +		if (isolate_lru_page(page)) {
> +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +			migrate->npages--;
> +			put_page(page);
> +			restore++;
> +		} else
> +			/* Drop the reference we took in collect */
> +			put_page(page);
> +
> +check:
> +		if (!hmm_migrate_page_check(page, 1)) {
> +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +			migrate->npages--;
> +			restore++;
> +		}
> +	}
> +
> +	if (!restore)
> +		return;
> +
> +	for (addr = migrate->start, i = 0; addr < migrate->end;) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +		unsigned long next, restart;
> +		spinlock_t *ptl;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE)) {
> +			addr += PAGE_SIZE;
> +			i++;
> +			continue;
> +		}
> +
> +		restart = addr;
> +		pgdp = pgd_offset(mm, addr);
> +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pudp = pud_offset(pgdp, addr);
> +		if (!pudp || pud_none(*pudp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pmdp = pmd_offset(pudp, addr);
> +		next = pmd_addr_end(addr, migrate->end);
> +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> +			addr = next;
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> +			swp_entry_t entry;
> +			bool write;
> +			pte_t pte;
> +
> +			page = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> +				continue;
> +
> +			write = migrate->pfns[i] & HMM_PFN_WRITE;
> +			write &= (vma->vm_flags & VM_WRITE);
> +
> +			/* Here it means pte must be a valid migration entry */
> +			pte = ptep_get_and_clear(mm, addr, ptep);
> +			if (pte_none(pte) || pte_present(pte))
> +				/* SOMETHING BAD IS GOING ON ! */
> +				continue;
> +			entry = pte_to_swp_entry(pte);
> +			if (!is_migration_entry(entry))
> +				/* SOMETHING BAD IS GOING ON ! */
> +				continue;
> +
> +			if (is_zone_device_page(page) &&
> +			    !is_addressable_page(page)) {
> +				entry = make_device_entry(page, write);
> +				pte = swp_entry_to_pte(entry);
> +			} else {
> +				pte = mk_pte(page, vma->vm_page_prot);
> +				pte = pte_mkold(pte);
> +				if (write)
> +					pte = pte_mkwrite(pte);
> +			}
> +			if (pte_swp_soft_dirty(*ptep))
> +				pte = pte_mksoft_dirty(pte);
> +
> +			get_page(page);
> +			set_pte_at(mm, addr, ptep, pte);
> +			if (PageAnon(page))
> +				page_add_anon_rmap(page, vma, addr, false);
> +			else
> +				page_add_file_rmap(page, false);

Why do we do the rmap bits here?


> +		}
> +		pte_unmap_unlock(ptep - 1, ptl);
> +
> +		addr = restart;
> +		i = (addr - migrate->start) >> PAGE_SHIFT;
> +		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
> +			page = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> +				continue;
> +
> +			migrate->pfns[i] = 0;
> +			unlock_page(page);
> +			restore--;
> +
> +			if (is_zone_device_page(page)) {
> +				put_page(page);
> +				continue;
> +			}
> +
> +			putback_lru_page(page);
> +		}
> +
> +		if (!restore)
> +			break;
> +	}
> +}
> +
> +static void hmm_migrate_unmap(struct hmm_migrate *migrate)
> +{
> +	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	unsigned long addr = migrate->start, i = 0, restore = 0;
> +
> +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +
> +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> +			continue;
> +
> +		try_to_unmap(page, flags);
> +		if (page_mapped(page) || !hmm_migrate_page_check(page, 1)) {
> +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +			migrate->npages--;
> +			restore++;
> +		}
> +	}
> +
> +	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
> +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> +
> +		if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> +			continue;
> +
> +		remove_migration_ptes(page, page, false);
> +
> +		migrate->pfns[i] = 0;
> +		unlock_page(page);
> +		restore--;
> +
> +		if (is_zone_device_page(page)) {
> +			put_page(page);
> +			continue;
> +		}
> +
> +		putback_lru_page(page);
> +	}
> +}
> +
> +static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
> +{
> +	unsigned long addr = migrate->start, i = 0;
> +	struct mm_struct *mm = migrate->vma->vm_mm;
> +
> +	for (; addr < migrate->end;) {
> +		unsigned long next;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		pgdp = pgd_offset(mm, addr);
> +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pudp = pud_offset(pgdp, addr);
> +		if (!pudp || pud_none(*pudp)) {
> +			addr = pgd_addr_end(addr, migrate->end);
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +		pmdp = pmd_offset(pudp, addr);
> +		next = pmd_addr_end(addr, migrate->end);
> +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> +			addr = next;
> +			i = (addr - migrate->start) >> PAGE_SHIFT;
> +			continue;
> +		}
> +
> +		/* No need to lock nothing can change from under us */
> +		ptep = pte_offset_map(pmdp, addr);
> +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> +			struct address_space *mapping;
> +			struct page *newpage, *page;
> +			swp_entry_t entry;
> +			int r;
> +
> +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!newpage || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> +				continue;
> +			if (pte_none(*ptep) || pte_present(*ptep)) {
> +				/* This should not happen but be nice */
> +				migrate->pfns[i] = 0;
> +				put_page(newpage);
> +				continue;
> +			}
> +			entry = pte_to_swp_entry(*ptep);
> +			if (!is_migration_entry(entry)) {
> +				/* This should not happen but be nice */
> +				migrate->pfns[i] = 0;
> +				put_page(newpage);
> +				continue;
> +			}
> +
> +			page = migration_entry_to_page(entry);
> +			mapping = page_mapping(page);
> +
> +			/*
> +			 * For now only support private anonymous when migrating
> +			 * to un-addressable device memory.

I thought HMM supported page cache migration as well.

> +			 */
> +			if (mapping && is_zone_device_page(newpage) &&
> +			    !is_addressable_page(newpage)) {
> +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +				continue;
> +			}
> +
> +			r = migrate_page(mapping, newpage, page,
> +					 MIGRATE_SYNC, false);
> +			if (r != MIGRATEPAGE_SUCCESS)
> +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> +		}
> +		pte_unmap(ptep - 1);
> +	}
> +}
> +
> +static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
> +{
> +	unsigned long addr = migrate->start, i = 0;
> +	struct mm_struct *mm = migrate->vma->vm_mm;
> +
> +	for (; addr < migrate->end;) {
> +		unsigned long next;
> +		pgd_t *pgdp;
> +		pud_t *pudp;
> +		pmd_t *pmdp;
> +		pte_t *ptep;
> +
> +		pgdp = pgd_offset(mm, addr);
> +		pudp = pud_offset(pgdp, addr);
> +		pmdp = pmd_offset(pudp, addr);
> +		next = pmd_addr_end(addr, migrate->end);
> +
> +		/* No need to lock nothing can change from under us */
> +		ptep = pte_offset_map(pmdp, addr);
> +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> +			struct page *page, *newpage;
> +			swp_entry_t entry;
> +
> +			if (pte_none(*ptep) || pte_present(*ptep))
> +				continue;
> +			entry = pte_to_swp_entry(*ptep);
> +			if (!is_migration_entry(entry))
> +				continue;
> +
> +			page = migration_entry_to_page(entry);
> +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> +			if (!newpage)
> +				newpage = page;
> +			remove_migration_ptes(page, newpage, false);
> +
> +			migrate->pfns[i] = 0;
> +			unlock_page(page);
> +			migrate->npages--;
> +
> +			if (is_zone_device_page(page))
> +				put_page(page);
> +			else
> +				putback_lru_page(page);
> +
> +			if (newpage != page) {
> +				unlock_page(newpage);
> +				if (is_zone_device_page(newpage))
> +					put_page(newpage);
> +				else
> +					putback_lru_page(newpage);
> +			}
> +		}
> +		pte_unmap(ptep - 1);
> +	}
> +}
> +
> +/*
> + * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
> + *
> + * @ops: migration callback for allocating destination memory and copying
> + * @vma: virtual memory area containing the range to be migrated
> + * @start: start address of the range to migrate (inclusive)
> + * @end: end address of the range to migrate (exclusive)
> + * @pfns: array of hmm_pfn_t first containing source pfns then destination
> + * @private: pointer passed back to each of the callback
> + * Returns: 0 on success, error code otherwise
> + *
> + * This will try to migrate a range of memory using callback to allocate and
> + * copy memory from source to destination. This function will first collect,
> + * lock and unmap pages in the range and then call alloc_and_copy() callback
> + * for device driver to allocate destination memory and copy from source.
> + *
> + * Then it will proceed and try to effectively migrate the page (struct page
> + * metadata) a step that can fail for various reasons. Before updating CPU page
> + * table it will call finalize_and_map() callback so that device driver can
> + * inspect what have been successfully migrated and update its own page table
> + * (this latter aspect is not mandatory and only make sense for some user of
> + * this API).
> + *
> + * Finaly the function update CPU page table and unlock the pages before
> + * returning 0.
> + *
> + * It will return an error code only if one of the argument is invalid.
> + */
> +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> +		    struct vm_area_struct *vma,
> +		    unsigned long start,
> +		    unsigned long end,
> +		    hmm_pfn_t *pfns,
> +		    void *private)
> +{
> +	struct hmm_migrate migrate;
> +
> +	/* Sanity check the arguments */
> +	start &= PAGE_MASK;
> +	end &= PAGE_MASK;
> +	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
> +		return -EINVAL;
> +	if (!vma || !ops || !pfns || start >= end)
> +		return -EINVAL;
> +	if (start < vma->vm_start || start >= vma->vm_end)
> +		return -EINVAL;
> +	if (end <= vma->vm_start || end > vma->vm_end)
> +		return -EINVAL;
> +
> +	migrate.start = start;
> +	migrate.pfns = pfns;
> +	migrate.npages = 0;
> +	migrate.end = end;
> +	migrate.vma = vma;
> +
> +	/* Collect, and try to unmap source pages */
> +	hmm_migrate_collect(&migrate);
> +	if (!migrate.npages)
> +		return 0;
> +
> +	/* Lock and isolate page */
> +	hmm_migrate_lock_and_isolate(&migrate);
> +	if (!migrate.npages)
> +		return 0;
> +
> +	/* Unmap pages */
> +	hmm_migrate_unmap(&migrate);
> +	if (!migrate.npages)
> +		return 0;
> +
> +	/*
> +	 * At this point pages are lock and unmap and thus they have stable
> +	 * content and can safely be copied to destination memory that is
> +	 * allocated by the callback.
> +	 *
> +	 * Note that migration can fail in hmm_migrate_struct_page() for each
> +	 * individual page.
> +	 */
> +	ops->alloc_and_copy(vma, start, end, pfns, private);

What is the expectation from alloc_and_copy()? Can it fail?

> +
> +	/* This does the real migration of struct page */
> +	hmm_migrate_struct_page(&migrate);
> +
> +	ops->finalize_and_map(vma, start, end, pfns, private);

Is this just notification to the driver or more?

> +
> +	/* Unlock and remap pages */
> +	hmm_migrate_remove_migration_pte(&migrate);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(hmm_vma_migrate);
> +#endif /* CONFIG_HMM */
> 

Balbir Singh

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-21  0:44   ` Balbir Singh
@ 2016-11-21  4:53     ` Jerome Glisse
  2016-11-21  6:57       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  4:53 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Mon, Nov 21, 2016 at 11:44:36AM +1100, Balbir Singh wrote:
> 
> 
> On 19/11/16 05:18, Jérôme Glisse wrote:
> > Only usefull for arch where we support ZONE_DEVICE and where we want to
> > also support un-addressable device memory. We need struct page for such
> > un-addressable memory. But we should avoid populating the kernel linear
> > mapping for the physical address range because there is no real memory
> > or anything behind those physical address.
> > 
> > Hence we need more flags than just knowing if it is device memory or not.
> > 
> 
> 
> Isn't it better to add a wrapper to arch_add/remove_memory and do those
> checks inside and then call arch_add/remove_memory to reduce the churn.
> If you need selectively enable MEMORY_UNADDRESSABLE that can be done with
> _ARCH_HAS_FEATURE

The flag parameter can be use by other new features and thus i thought the
churn was fine. But i do not mind either way, whatever people like best.

[...]

> > -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> > +
> > +/*
> > + * For device memory we want more informations than just knowing it is device
> 				     information
> > + * memory. We want to know if we can migrate it (ie it is not storage memory
> > + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
> > + * memory can not be access by CPU but we still want struct page so that we
> 			accessed
> > + * can use it like regular memory.
> 
> Can you please add some details on why -- migration needs them for example?

I am not sure what you mean ? DAX ie persistent memory device is intended to be
use for filesystem or persistent storage. Hence memory migration does not apply
to it (it would go against its purpose).

So i want to extend ZONE_DEVICE to be more then just DAX/persistent memory. For
that i need to differentatiate between device memory that can be migrated and
should be more or less treated like regular memory (with struct page). This is
what the MEMORY_MOVABLE flag is for.

Finaly in my case the device memory is not accessible by the CPU so i need yet
another flag. In the end i am extending ZONE_DEVICE to be use for 3 differents
type of memory.

Is this the kind of explanation you are looking for ?

> > + */
> > +#define MEMORY_FLAGS_NONE 0
> > +#define MEMORY_DEVICE (1 << 0)
> > +#define MEMORY_MOVABLE (1 << 1)
> > +#define MEMORY_UNADDRESSABLE (1 << 2)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-11-21  1:49   ` Balbir Singh
@ 2016-11-21  4:57     ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  4:57 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 12:49:55PM +1100, Balbir Singh wrote:
> On 19/11/16 05:18, Jérôme Glisse wrote:
> > When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
> > is holding a reference on it (only device to which the memory belong do).
> > Add a callback and call it when that happen so device driver can implement
> > their own free page management.
> > 
> 
> Could you give an example of what their own free page management might look like?

Well hard to do that, the free management is whatever the device driver want to do.
So i don't have any example to give. Each device driver (especialy GPU ones) have
their own memory management with little commonality.

So how the device driver manage that memory is really not important, at least it is
not something for which i want to impose a policy onto driver. I want to leave each
device driver decide on how to achieve that.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-21  2:06   ` Balbir Singh
@ 2016-11-21  5:05     ` Jerome Glisse
  2016-11-22  2:19       ` Balbir Singh
  2016-11-21 11:10     ` Anshuman Khandual
  1 sibling, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  5:05 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 01:06:45PM +1100, Balbir Singh wrote:
> 
> 
> On 19/11/16 05:18, Jérôme Glisse wrote:
> > To allow use of device un-addressable memory inside a process add a
> > special swap type. Also add a new callback to handle page fault on
> > such entry.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/proc/task_mmu.c       | 10 +++++++-
> >  include/linux/memremap.h |  5 ++++
> >  include/linux/swap.h     | 18 ++++++++++---
> >  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  kernel/memremap.c        | 14 ++++++++++
> >  mm/Kconfig               | 12 +++++++++
> >  mm/memory.c              | 24 +++++++++++++++++
> >  mm/mprotect.c            | 12 +++++++++
> >  8 files changed, 158 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 6909582..0726d39 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> >  			} else {
> >  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
> >  			}
> > -		} else if (is_migration_entry(swpent))
> > +		} else if (is_migration_entry(swpent)) {
> >  			page = migration_entry_to_page(swpent);
> > +		} else if (is_device_entry(swpent)) {
> > +			page = device_entry_to_page(swpent);
> > +		}
> 
> 
> So the reason there is a device swap entry for a page belonging to a user process is
> that it is in the middle of migration or is it always that a swap entry represents
> unaddressable memory belonging to a GPU device, but its tracked in the page table
> entries of the process.

For page being migrated i use the existing special migration pte entry. This new device
special swap entry is only for unaddressable memory belonging to a device (GPU or any
else). We need to keep track of those inside the CPU page table. Using a new special
swap entry is the easiest way with the minimum amount of change to core mm.

[...]

> > +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> > +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> > +{
> > +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
> 
> Code style checks

I was trying to balance against 79 columns break rule :)

[...]

> > +		} else if (is_device_entry(entry)) {
> > +			page = device_entry_to_page(entry);
> > +
> > +			get_page(page);
> > +			rss[mm_counter(page)]++;
> 
> Why does rss count go up?

I wanted the device page to be treated like any other page. There is an argument
to be made against and for doing that. Do you have strong argument for not doing
this ?

[...]

> > @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
> >  	if (unlikely(non_swap_entry(entry))) {
> >  		if (is_migration_entry(entry)) {
> >  			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
> > +		} else if (is_device_entry(entry)) {
> > +			ret = device_entry_fault(vma, fe->address, entry,
> > +						 fe->flags, fe->pmd);
> 
> What does device_entry_fault() actually do here?

Well it is a special fault handler, it must migrate the memory back to some place
where the CPU can access it. It only matter for unaddressable memory.

> >  		} else if (is_hwpoison_entry(entry)) {
> >  			ret = VM_FAULT_HWPOISON;
> >  		} else {
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 1bc1eb3..70aff3a 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  
> >  				pages++;
> >  			}
> > +
> > +			if (is_write_device_entry(entry)) {
> > +				pte_t newpte;
> > +
> > +				make_device_entry_read(&entry);
> > +				newpte = swp_entry_to_pte(entry);
> > +				if (pte_swp_soft_dirty(oldpte))
> > +					newpte = pte_swp_mksoft_dirty(newpte);
> > +				set_pte_at(mm, addr, pte, newpte);
> > +
> > +				pages++;
> > +			}
> 
> Does it make sense to call mprotect() on device memory ranges?

There is nothing special about vma that containt device memory. They can be
private anonymous, share, file back ... So any existing memory syscall must
behave as expected. This is really just like any other page except that CPU
can not access it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2016-11-21  2:08   ` Balbir Singh
@ 2016-11-21  5:08     ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  5:08 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

On Mon, Nov 21, 2016 at 01:08:56PM +1100, Balbir Singh wrote:
> 
> 
> On 19/11/16 05:18, Jérôme Glisse wrote:
> > It does not need much, just skip populating kernel linear mapping
> > for range of un-addressable device memory (it is pick so that there
> > is no physical memory resource overlapping it). All the logic is in
> > share mm code.
> > 
> > Only support x86-64 as this feature doesn't make much sense with
> > constrained virtual address space of 32bits architecture.
> > 
> 
> Is there a reason this would not work on powerpc64 for example?
> Could you document the limitations -- testing/APIs/missing features?

It should be straight forward for powerpc64, i haven't done it but i
certainly can try to get access to some powerpc64 and add support for
it.

The only thing to do is to avoid creating kernel linear mapping for the
un-addressable memory (just for safety reasons we do not want any read/
write to invalid physical address).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-21  2:29   ` Balbir Singh
@ 2016-11-21  5:14     ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  5:14 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, Nov 21, 2016 at 01:29:23PM +1100, Balbir Singh wrote:
> On 19/11/16 05:18, Jérôme Glisse wrote:
> > HMM provides 3 separate functionality :
> >     - Mirroring: synchronize CPU page table and device page table
> >     - Device memory: allocating struct page for device memory
> >     - Migration: migrating regular memory to device memory
> > 
> > This patch introduces some common helpers and definitions to all of
> > those 3 functionality.
> > 

[...]

> > +/*
> > + * HMM provides 3 separate functionality :
> > + *   - Mirroring: synchronize CPU page table and device page table
> > + *   - Device memory: allocating struct page for device memory
> > + *   - Migration: migrating regular memory to device memory
> > + *
> > + * Each can be use independently from the others.
> > + *
> > + *
> > + * Mirroring:
> > + *
> > + * HMM provide helpers to mirror process address space on a device. For this it
> > + * provides several helpers to order device page table update in respect to CPU
> > + * page table update. Requirement is that for any given virtual address the CPU
> > + * and device page table can not point to different physical page. It uses the
> > + * mmu_notifier API and introduce virtual address range lock which block CPU
> > + * page table update for a range while the device page table is being updated.
> > + * Usage pattern is:
> > + *
> > + *      hmm_vma_range_lock(vma, start, end);
> > + *      // snap shot CPU page table
> > + *      // update device page table from snapshot
> > + *      hmm_vma_range_unlock(vma, start, end);
> > + *
> > + * Any CPU page table update that conflict with a range lock will wait until
> > + * range is unlock. This garanty proper serialization of CPU and device page
> > + * table update.
> > + *
> > + *
> > + * Device memory:
> > + *
> > + * HMM provides helpers to help leverage device memory either addressable like
> > + * regular memory by the CPU or un-addressable at all. In both case the device
> > + * memory is associated to dedicated structs page (which are allocated like for
> > + * hotplug memory). Device memory management is under the responsability of the
> > + * device driver. HMM only allocate and initialize the struct pages associated
> > + * with the device memory.
> > + *
> > + * Allocating struct page for device memory allow to use device memory allmost
> > + * like any regular memory. Unlike regular memory it can not be added to the
> > + * lru, nor can any memory allocation can use device memory directly. Device
> > + * memory will only end up to be use in a process if device driver migrate some
> 				   in use 
> > + * of the process memory from regular memory to device memory.
> > + *
> 
> A process can never directly allocate device memory?

Well yes and no, if the device driver is first to trigger a page fault on some
memory then it can decide to directly allocate device memory. But usual CPU page
fault would not trigger allocation of device memory. A new mechanism can be added
to achieve that if that make sense but for my main target (x86/pcie) it does not.

> > + *
> > + * Migration:
> > + *
> > + * Existing memory migration mechanism (mm/migrate.c) does not allow to use
> > + * something else than the CPU to copy from source to destination memory. More
> > + * over existing code is not tailor to drive migration from process virtual
> 				tailored
> > + * address rather than from list of pages. Finaly the migration flow does not
> 					      Finally 
> > + * allow for graceful failure at different step of the migration process.
> > + *
> > + * HMM solves all of the above though simple API :
> > + *
> > + *      hmm_vma_migrate(vma, start, end, ops);
> > + *
> > + * With ops struct providing 2 callback alloc_and_copy() which allocated the
> > + * destination memory and initialize it using source memory. Migration can fail
> > + * after this step and thus last callback finalize_and_map() allow the device
> > + * driver to know which page were successfully migrated and which were not.
> > + *
> > + * This can easily be use outside of HMM intended use case.
> > + *
> 
> I think it is a good API to have
> 
> > + *
> > + * This header file contain all the API related to this 3 functionality and
> > + * each functions and struct are more thouroughly documented in below comments.
> > + */
> > +#ifndef LINUX_HMM_H
> > +#define LINUX_HMM_H
> > +
> > +#include <linux/kconfig.h>
> > +
> > +#if IS_ENABLED(CONFIG_HMM)
> > +
> > +
> > +/*
> > + * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
> 		      uses
> > + *
> > + * Flags:
> > + * HMM_PFN_VALID: pfn is valid
> > + * HMM_PFN_WRITE: CPU page table have the write permission set
> 				    has
> > + */
> > +typedef unsigned long hmm_pfn_t;
> > +
> > +#define HMM_PFN_VALID (1 << 0)
> > +#define HMM_PFN_WRITE (1 << 1)
> > +#define HMM_PFN_SHIFT 2
> > +
> > +static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
> > +{
> > +	if (!(pfn & HMM_PFN_VALID))
> > +		return NULL;
> > +	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> > +}
> > +
> > +static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
> > +{
> > +	if (!(pfn & HMM_PFN_VALID))
> > +		return -1UL;
> > +	return (pfn >> HMM_PFN_SHIFT);
> > +}
> > +
> 
> What is pfn_to_pfn? I presume it means CPU PFN to device PFN
> or is it the reverse? Please add some comments

It is hmm_pfn_t to pfn value as an unsigned long. The memory the pfn
points to can be anything (regular system memory, device memory, ...).

hmm_pfn_t is just a pfn with a set of flags.

> 
> > +static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
> > +{
> > +	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> > +}
> > +
> > +static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
> > +{
> > +	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> > +}
> > +
> 
> Same as above
> 
> > +
> > +/* Below are for HMM internal use only ! Not to be use by device driver ! */
> > +void hmm_mm_destroy(struct mm_struct *mm);
> > +
> > +#else /* IS_ENABLED(CONFIG_HMM) */
> > +
> > +/* Below are for HMM internal use only ! Not to be use by device driver ! */
> > +static inline void hmm_mm_destroy(struct mm_struct *mm) {}
> > +
> > +#endif /* IS_ENABLED(CONFIG_HMM) */
> > +#endif /* LINUX_HMM_H */
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 4a8aced..4effdbf 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -23,6 +23,7 @@
> >  
> >  struct address_space;
> >  struct mem_cgroup;
> > +struct hmm;
> >  
> >  #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
> >  #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
> > @@ -516,6 +517,10 @@ struct mm_struct {
> >  	atomic_long_t hugetlb_usage;
> >  #endif
> >  	struct work_struct async_put_work;
> > +#if IS_ENABLED(CONFIG_HMM)
> > +	/* HMM need to track few things per mm */
> > +	struct hmm *hmm;
> > +#endif
> >  };
> >  
> >  static inline void mm_init_cpumask(struct mm_struct *mm)
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 690a1aad..af0eec8 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -27,6 +27,7 @@
> >  #include <linux/binfmts.h>
> >  #include <linux/mman.h>
> >  #include <linux/mmu_notifier.h>
> > +#include <linux/hmm.h>
> >  #include <linux/fs.h>
> >  #include <linux/mm.h>
> >  #include <linux/vmacache.h>
> > @@ -702,6 +703,7 @@ void __mmdrop(struct mm_struct *mm)
> >  	BUG_ON(mm == &init_mm);
> >  	mm_free_pgd(mm);
> >  	destroy_context(mm);
> > +	hmm_mm_destroy(mm);
> >  	mmu_notifier_mm_destroy(mm);
> >  	check_mm(mm);
> >  	free_mm(mm);
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 0a21411..be18cc2 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -289,6 +289,17 @@ config MIGRATION
> >  config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >  	bool
> >  
> > +config HMM
> > +	bool "Heterogeneous memory management (HMM)"
> > +	depends on MMU
> > +	default n
> > +	help
> > +	  Heterogeneous memory management, set of helpers for:
> > +	    - mirroring of process address space on a device
> > +	    - using device memory transparently inside a process
> > +
> > +	  If unsure, say N to disable HMM.
> > +
> 
> It would be nice to split this into HMM, HMM_MIGRATE and HMM_MIRROR
> 
> >  config PHYS_ADDR_T_64BIT
> >  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> >  
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 2ca1faf..6ac1284 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -76,6 +76,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> >  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> >  obj-$(CONFIG_MEMTEST)		+= memtest.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_HMM) += hmm.o
> >  obj-$(CONFIG_QUICKLIST) += quicklist.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > new file mode 100644
> > index 0000000..342b596
> > --- /dev/null
> > +++ b/mm/hmm.c
> > @@ -0,0 +1,86 @@
> > +/*
> > + * Copyright 2013 Red Hat Inc.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * Authors: Jérôme Glisse <jglisse@redhat.com>
> > + */
> > +/*
> > + * Refer to include/linux/hmm.h for informations about heterogeneous memory
> > + * management or HMM for short.
> > + */
> > +#include <linux/mm.h>
> > +#include <linux/hmm.h>
> > +#include <linux/slab.h>
> > +#include <linux/sched.h>
> > +
> > +/*
> > + * struct hmm - HMM per mm struct
> > + *
> > + * @mm: mm struct this HMM struct is bound to
> > + */
> > +struct hmm {
> > +	struct mm_struct	*mm;
> > +};
> > +
> > +/*
> > + * hmm_register - register HMM against an mm (HMM internal)
> > + *
> > + * @mm: mm struct to attach to
> > + *
> > + * This is not intended to be use directly by device driver but by other HMM
> > + * component. It allocates an HMM struct if mm does not have one and initialize
> > + * it.
> > + */
> > +static struct hmm *hmm_register(struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm = NULL;
> > +
> > +	if (!mm->hmm) {
> > +		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
> > +		if (!hmm)
> > +			return NULL;
> > +		hmm->mm = mm;
> > +	}
> > +
> > +	spin_lock(&mm->page_table_lock);
> > +	if (!mm->hmm)
> > +		/*
> > +		 * The hmm struct can only be free once mm_struct goes away
> > +		 * hence we should always have pre-allocated an new hmm struct
> > +		 * above.
> > +		 */
> > +		mm->hmm = hmm;
> > +	else if (hmm)
> > +		kfree(hmm);
> > +	hmm = mm->hmm;
> > +	spin_unlock(&mm->page_table_lock);
> > +
> > +	return hmm;
> > +}
> > +
> > +void hmm_mm_destroy(struct mm_struct *mm)
> > +{
> > +	struct hmm *hmm;
> > +
> > +	/*
> > +	 * We should not need to lock here as no one should be able to register
> > +	 * a new HMM while an mm is being destroy. But just to be safe ...
> > +	 */
> > +	spin_lock(&mm->page_table_lock);
> > +	hmm = mm->hmm;
> > +	mm->hmm = NULL;
> > +	spin_unlock(&mm->page_table_lock);
> > +	if (!hmm)
> > +		return;
> > +
> 
> kfree can deal with NULL pointers, you can remove the if check

Yeah.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2016-11-21  2:42   ` Balbir Singh
@ 2016-11-21  5:18     ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  5:18 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, Nov 21, 2016 at 01:42:43PM +1100, Balbir Singh wrote:
> On 19/11/16 05:18, Jérôme Glisse wrote:

[...]

> > +/*
> > + * hmm_mirror_register() - register a mirror against an mm
> > + *
> > + * @mirror: new mirror struct to register
> > + * @mm: mm to register against
> > + *
> > + * To start mirroring a process address space device driver must register an
> > + * HMM mirror struct.
> > + */
> > +int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> > +{
> > +	/* Sanity check */
> > +	if (!mm || !mirror || !mirror->ops)
> > +		return -EINVAL;
> > +
> > +	mirror->hmm = hmm_register(mm);
> > +	if (!mirror->hmm)
> > +		return -ENOMEM;
> > +
> > +	/* Register mmu_notifier if not already, use mmap_sem for locking */
> > +	if (!mirror->hmm->mmu_notifier.ops) {
> > +		struct hmm *hmm = mirror->hmm;
> > +		down_write(&mm->mmap_sem);
> > +		if (!hmm->mmu_notifier.ops) {
> > +			hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> > +			if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
> > +				hmm->mmu_notifier.ops = NULL;
> > +				up_write(&mm->mmap_sem);
> > +				return -ENOMEM;
> > +			}
> > +		}
> > +		up_write(&mm->mmap_sem);
> > +	}
> 
> Does everything get mirrored, every update to the PTE (clear dirty, clear
> accessed bit, etc) or does the driver decide?

Driver decide but only read/write/valid matter for device. Device driver must
report dirtyness on invalidation. Some device do not have access bit and thus
can't provide that information.

The idea here is really to snapshot the CPU page table and duplicate it as
a GPU page table. The only synchronization HMM provide is that each virtual
address point to same memory at that at no point in time the same virtual
address can point to different physical memory on the device and on the CPU.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory
  2016-11-21  3:30   ` Balbir Singh
@ 2016-11-21  5:31     ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21  5:31 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Mon, Nov 21, 2016 at 02:30:46PM +1100, Balbir Singh wrote:
> On 19/11/16 05:18, Jérôme Glisse wrote:

[...]

> > +
> > +
> > +#if defined(CONFIG_HMM)
> > +struct hmm_migrate {
> > +	struct vm_area_struct	*vma;
> > +	unsigned long		start;
> > +	unsigned long		end;
> > +	unsigned long		npages;
> > +	hmm_pfn_t		*pfns;
> 
> I presume the destination is pfns[] or is the source?

Both when alloca_and_copy() is call it is fill with source memory, but once
that callback returns it must have set the destination memory inside that
array. This is what i discussed with Aneesh in this thread.

> > +};
> > +
> > +static int hmm_collect_walk_pmd(pmd_t *pmdp,
> > +				unsigned long start,
> > +				unsigned long end,
> > +				struct mm_walk *walk)
> > +{
> > +	struct hmm_migrate *migrate = walk->private;
> > +	struct mm_struct *mm = walk->vma->vm_mm;
> > +	unsigned long addr = start;
> > +	spinlock_t *ptl;
> > +	hmm_pfn_t *pfns;
> > +	int pages = 0;
> > +	pte_t *ptep;
> > +
> > +again:
> > +	if (pmd_none(*pmdp))
> > +		return 0;
> > +
> > +	split_huge_pmd(walk->vma, pmdp, addr);
> > +	if (pmd_trans_unstable(pmdp))
> > +		goto again;
> > +
> 
> OK., so we always split THP before migration

Yes because i need special swap entry and those does not exist for pmd.

> > +	pfns = &migrate->pfns[(addr - migrate->start) >> PAGE_SHIFT];
> > +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +
> > +	for (; addr < end; addr += PAGE_SIZE, pfns++, ptep++) {
> > +		unsigned long pfn;
> > +		swp_entry_t entry;
> > +		struct page *page;
> > +		hmm_pfn_t flags;
> > +		bool write;
> > +		pte_t pte;
> > +
> > +		pte = ptep_get_and_clear(mm, addr, ptep);
> > +		if (!pte_present(pte)) {
> > +			if (pte_none(pte))
> > +				continue;
> > +
> > +			entry = pte_to_swp_entry(pte);
> > +			if (!is_device_entry(entry)) {
> > +				set_pte_at(mm, addr, ptep, pte);
> 
> Why hard code this, in general the ability to migrate a VMA
> start/end range seems like a useful API.

Some memory can not be migrated, can not migrate something that is already
being migrated or something that is swap or something that is bad memory
... I only try to migrate valid memory.

> > +				continue;
> > +			}
> > +
> > +			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
> 
> Currently UNADDRESSABLE?

Yes, this is a special device swap entry and those it is unaddressable memory.
The destination memory might also be unaddressable (migrating from one device
to another device).


> > +			page = device_entry_to_page(entry);
> > +			write = is_write_device_entry(entry);
> > +			pfn = page_to_pfn(page);
> > +
> > +			if (!(page->pgmap->flags & MEMORY_MOVABLE)) {
> > +				set_pte_at(mm, addr, ptep, pte);
> > +				continue;
> > +			}
> > +
> > +		} else {
> > +			pfn = pte_pfn(pte);
> > +			page = pfn_to_page(pfn);
> > +			write = pte_write(pte);
> > +			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
> > +		}
> > +
> > +		/* FIXME support THP see hmm_migrate_page_check() */
> > +		if (PageTransCompound(page))
> > +			continue;
> 
> Didn't we split the THP above?

We splited huge pmd not huge page. Intention is to support huge page but i wanted
to keep patch simple and THP need special handling when it comes to refcount to
check for pin (either on huge page or on one of its tail page).

> 
> > +
> > +		*pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
> > +		*pfns |= write ? HMM_PFN_WRITE : 0;
> > +		migrate->npages++;
> > +		get_page(page);
> > +
> > +		if (!trylock_page(page)) {
> > +			set_pte_at(mm, addr, ptep, pte);
> 
> put_page()?

No, we will try latter to lock the page and thus we want to keep a ref on the page.

> > +		} else {
> > +			pte_t swp_pte;
> > +
> > +			*pfns |= HMM_PFN_LOCKED;
> > +
> > +			entry = make_migration_entry(page, write);
> > +			swp_pte = swp_entry_to_pte(entry);
> > +			if (pte_soft_dirty(pte))
> > +				swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +			set_pte_at(mm, addr, ptep, swp_pte);
> > +
> > +			page_remove_rmap(page, false);
> > +			put_page(page);
> > +			pages++;
> > +		}
> > +	}
> > +
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(ptep - 1, ptl);
> > +
> > +	/* Only flush the TLB if we actually modified any entries */
> > +	if (pages)
> > +		flush_tlb_range(walk->vma, start, end);
> > +
> > +	return 0;
> > +}
> > +
> > +static void hmm_migrate_collect(struct hmm_migrate *migrate)
> > +{
> > +	struct mm_walk mm_walk;
> > +
> > +	mm_walk.pmd_entry = hmm_collect_walk_pmd;
> > +	mm_walk.pte_entry = NULL;
> > +	mm_walk.pte_hole = NULL;
> > +	mm_walk.hugetlb_entry = NULL;
> > +	mm_walk.test_walk = NULL;
> > +	mm_walk.vma = migrate->vma;
> > +	mm_walk.mm = migrate->vma->vm_mm;
> > +	mm_walk.private = migrate;
> > +
> > +	mmu_notifier_invalidate_range_start(mm_walk.mm,
> > +					    migrate->start,
> > +					    migrate->end);
> > +	walk_page_range(migrate->start, migrate->end, &mm_walk);
> > +	mmu_notifier_invalidate_range_end(mm_walk.mm,
> > +					  migrate->start,
> > +					  migrate->end);
> > +}
> > +
> > +static inline bool hmm_migrate_page_check(struct page *page, int extra)
> > +{
> > +	/*
> > +	 * FIXME support THP (transparent huge page), it is bit more complex to
> > +	 * check them then regular page because they can be map with a pmd or
> > +	 * with a pte (split pte mapping).
> > +	 */
> > +	if (PageCompound(page))
> > +		return false;
> 
> PageTransCompound()?

Yes, right now i think on all arch it is equivalent.


> > +
> > +	if (is_zone_device_page(page))
> > +		extra++;
> > +
> > +	if ((page_count(page) - extra) > page_mapcount(page))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
> > +{
> > +	unsigned long addr = migrate->start, i = 0;
> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> > +	struct vm_area_struct *vma = migrate->vma;
> > +	unsigned long restore = 0;
> > +	bool allow_drain = true;
> > +
> > +	lru_add_drain();
> > +
> > +again:
> > +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +
> > +		if (!page)
> > +			continue;
> > +
> > +		if (!(migrate->pfns[i] & HMM_PFN_LOCKED)) {
> > +			lock_page(page);
> > +			migrate->pfns[i] |= HMM_PFN_LOCKED;
> > +		}
> > +
> > +		/* ZONE_DEVICE page are not on LRU */
> > +		if (is_zone_device_page(page))
> > +			goto check;
> > +
> > +		if (!PageLRU(page) && allow_drain) {
> > +			/* Drain CPU's pagevec so page can be isolated */
> > +			lru_add_drain_all();
> > +			allow_drain = false;
> > +			goto again;
> > +		}
> > +
> > +		if (isolate_lru_page(page)) {
> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +			migrate->npages--;
> > +			put_page(page);
> > +			restore++;
> > +		} else
> > +			/* Drop the reference we took in collect */
> > +			put_page(page);
> > +
> > +check:
> > +		if (!hmm_migrate_page_check(page, 1)) {
> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +			migrate->npages--;
> > +			restore++;
> > +		}
> > +	}
> > +
> > +	if (!restore)
> > +		return;
> > +
> > +	for (addr = migrate->start, i = 0; addr < migrate->end;) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +		unsigned long next, restart;
> > +		spinlock_t *ptl;
> > +		pgd_t *pgdp;
> > +		pud_t *pudp;
> > +		pmd_t *pmdp;
> > +		pte_t *ptep;
> > +
> > +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE)) {
> > +			addr += PAGE_SIZE;
> > +			i++;
> > +			continue;
> > +		}
> > +
> > +		restart = addr;
> > +		pgdp = pgd_offset(mm, addr);
> > +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pudp = pud_offset(pgdp, addr);
> > +		if (!pudp || pud_none(*pudp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pmdp = pmd_offset(pudp, addr);
> > +		next = pmd_addr_end(addr, migrate->end);
> > +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> > +			addr = next;
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> > +			swp_entry_t entry;
> > +			bool write;
> > +			pte_t pte;
> > +
> > +			page = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +				continue;
> > +
> > +			write = migrate->pfns[i] & HMM_PFN_WRITE;
> > +			write &= (vma->vm_flags & VM_WRITE);
> > +
> > +			/* Here it means pte must be a valid migration entry */
> > +			pte = ptep_get_and_clear(mm, addr, ptep);
> > +			if (pte_none(pte) || pte_present(pte))
> > +				/* SOMETHING BAD IS GOING ON ! */
> > +				continue;
> > +			entry = pte_to_swp_entry(pte);
> > +			if (!is_migration_entry(entry))
> > +				/* SOMETHING BAD IS GOING ON ! */
> > +				continue;
> > +
> > +			if (is_zone_device_page(page) &&
> > +			    !is_addressable_page(page)) {
> > +				entry = make_device_entry(page, write);
> > +				pte = swp_entry_to_pte(entry);
> > +			} else {
> > +				pte = mk_pte(page, vma->vm_page_prot);
> > +				pte = pte_mkold(pte);
> > +				if (write)
> > +					pte = pte_mkwrite(pte);
> > +			}
> > +			if (pte_swp_soft_dirty(*ptep))
> > +				pte = pte_mksoft_dirty(pte);
> > +
> > +			get_page(page);
> > +			set_pte_at(mm, addr, ptep, pte);
> > +			if (PageAnon(page))
> > +				page_add_anon_rmap(page, vma, addr, false);
> > +			else
> > +				page_add_file_rmap(page, false);
> 
> Why do we do the rmap bits here?

Because we did page_remove_rmap() in hmm_migrate_collect() so we need to restore
rmap.


> > +		}
> > +		pte_unmap_unlock(ptep - 1, ptl);
> > +
> > +		addr = restart;
> > +		i = (addr - migrate->start) >> PAGE_SHIFT;
> > +		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
> > +			page = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +				continue;
> > +
> > +			migrate->pfns[i] = 0;
> > +			unlock_page(page);
> > +			restore--;
> > +
> > +			if (is_zone_device_page(page)) {
> > +				put_page(page);
> > +				continue;
> > +			}
> > +
> > +			putback_lru_page(page);
> > +		}
> > +
> > +		if (!restore)
> > +			break;
> > +	}
> > +}
> > +
> > +static void hmm_migrate_unmap(struct hmm_migrate *migrate)
> > +{
> > +	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> > +	unsigned long addr = migrate->start, i = 0, restore = 0;
> > +
> > +	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +
> > +		if (!page || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +			continue;
> > +
> > +		try_to_unmap(page, flags);
> > +		if (page_mapped(page) || !hmm_migrate_page_check(page, 1)) {
> > +			migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +			migrate->npages--;
> > +			restore++;
> > +		}
> > +	}
> > +
> > +	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
> > +		struct page *page = hmm_pfn_to_page(migrate->pfns[i]);
> > +
> > +		if (!page || (migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +			continue;
> > +
> > +		remove_migration_ptes(page, page, false);
> > +
> > +		migrate->pfns[i] = 0;
> > +		unlock_page(page);
> > +		restore--;
> > +
> > +		if (is_zone_device_page(page)) {
> > +			put_page(page);
> > +			continue;
> > +		}
> > +
> > +		putback_lru_page(page);
> > +	}
> > +}
> > +
> > +static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
> > +{
> > +	unsigned long addr = migrate->start, i = 0;
> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> > +
> > +	for (; addr < migrate->end;) {
> > +		unsigned long next;
> > +		pgd_t *pgdp;
> > +		pud_t *pudp;
> > +		pmd_t *pmdp;
> > +		pte_t *ptep;
> > +
> > +		pgdp = pgd_offset(mm, addr);
> > +		if (!pgdp || pgd_none_or_clear_bad(pgdp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pudp = pud_offset(pgdp, addr);
> > +		if (!pudp || pud_none(*pudp)) {
> > +			addr = pgd_addr_end(addr, migrate->end);
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +		pmdp = pmd_offset(pudp, addr);
> > +		next = pmd_addr_end(addr, migrate->end);
> > +		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp)) {
> > +			addr = next;
> > +			i = (addr - migrate->start) >> PAGE_SHIFT;
> > +			continue;
> > +		}
> > +
> > +		/* No need to lock nothing can change from under us */
> > +		ptep = pte_offset_map(pmdp, addr);
> > +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> > +			struct address_space *mapping;
> > +			struct page *newpage, *page;
> > +			swp_entry_t entry;
> > +			int r;
> > +
> > +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!newpage || !(migrate->pfns[i] & HMM_PFN_MIGRATE))
> > +				continue;
> > +			if (pte_none(*ptep) || pte_present(*ptep)) {
> > +				/* This should not happen but be nice */
> > +				migrate->pfns[i] = 0;
> > +				put_page(newpage);
> > +				continue;
> > +			}
> > +			entry = pte_to_swp_entry(*ptep);
> > +			if (!is_migration_entry(entry)) {
> > +				/* This should not happen but be nice */
> > +				migrate->pfns[i] = 0;
> > +				put_page(newpage);
> > +				continue;
> > +			}
> > +
> > +			page = migration_entry_to_page(entry);
> > +			mapping = page_mapping(page);
> > +
> > +			/*
> > +			 * For now only support private anonymous when migrating
> > +			 * to un-addressable device memory.
> 
> I thought HMM supported page cache migration as well.

Not for un-addressable memory. Un-addressable memory need more change to filesystem
to handle read/write and writeback. This will be part of a separate patchset.


> > +			 */
> > +			if (mapping && is_zone_device_page(newpage) &&
> > +			    !is_addressable_page(newpage)) {
> > +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +				continue;
> > +			}
> > +
> > +			r = migrate_page(mapping, newpage, page,
> > +					 MIGRATE_SYNC, false);
> > +			if (r != MIGRATEPAGE_SUCCESS)
> > +				migrate->pfns[i] &= ~HMM_PFN_MIGRATE;
> > +		}
> > +		pte_unmap(ptep - 1);
> > +	}
> > +}
> > +
> > +static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
> > +{
> > +	unsigned long addr = migrate->start, i = 0;
> > +	struct mm_struct *mm = migrate->vma->vm_mm;
> > +
> > +	for (; addr < migrate->end;) {
> > +		unsigned long next;
> > +		pgd_t *pgdp;
> > +		pud_t *pudp;
> > +		pmd_t *pmdp;
> > +		pte_t *ptep;
> > +
> > +		pgdp = pgd_offset(mm, addr);
> > +		pudp = pud_offset(pgdp, addr);
> > +		pmdp = pmd_offset(pudp, addr);
> > +		next = pmd_addr_end(addr, migrate->end);
> > +
> > +		/* No need to lock nothing can change from under us */
> > +		ptep = pte_offset_map(pmdp, addr);
> > +		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
> > +			struct page *page, *newpage;
> > +			swp_entry_t entry;
> > +
> > +			if (pte_none(*ptep) || pte_present(*ptep))
> > +				continue;
> > +			entry = pte_to_swp_entry(*ptep);
> > +			if (!is_migration_entry(entry))
> > +				continue;
> > +
> > +			page = migration_entry_to_page(entry);
> > +			newpage = hmm_pfn_to_page(migrate->pfns[i]);
> > +			if (!newpage)
> > +				newpage = page;
> > +			remove_migration_ptes(page, newpage, false);
> > +
> > +			migrate->pfns[i] = 0;
> > +			unlock_page(page);
> > +			migrate->npages--;
> > +
> > +			if (is_zone_device_page(page))
> > +				put_page(page);
> > +			else
> > +				putback_lru_page(page);
> > +
> > +			if (newpage != page) {
> > +				unlock_page(newpage);
> > +				if (is_zone_device_page(newpage))
> > +					put_page(newpage);
> > +				else
> > +					putback_lru_page(newpage);
> > +			}
> > +		}
> > +		pte_unmap(ptep - 1);
> > +	}
> > +}
> > +
> > +/*
> > + * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
> > + *
> > + * @ops: migration callback for allocating destination memory and copying
> > + * @vma: virtual memory area containing the range to be migrated
> > + * @start: start address of the range to migrate (inclusive)
> > + * @end: end address of the range to migrate (exclusive)
> > + * @pfns: array of hmm_pfn_t first containing source pfns then destination
> > + * @private: pointer passed back to each of the callback
> > + * Returns: 0 on success, error code otherwise
> > + *
> > + * This will try to migrate a range of memory using callback to allocate and
> > + * copy memory from source to destination. This function will first collect,
> > + * lock and unmap pages in the range and then call alloc_and_copy() callback
> > + * for device driver to allocate destination memory and copy from source.
> > + *
> > + * Then it will proceed and try to effectively migrate the page (struct page
> > + * metadata) a step that can fail for various reasons. Before updating CPU page
> > + * table it will call finalize_and_map() callback so that device driver can
> > + * inspect what have been successfully migrated and update its own page table
> > + * (this latter aspect is not mandatory and only make sense for some user of
> > + * this API).
> > + *
> > + * Finaly the function update CPU page table and unlock the pages before
> > + * returning 0.
> > + *
> > + * It will return an error code only if one of the argument is invalid.
> > + */
> > +int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
> > +		    struct vm_area_struct *vma,
> > +		    unsigned long start,
> > +		    unsigned long end,
> > +		    hmm_pfn_t *pfns,
> > +		    void *private)
> > +{
> > +	struct hmm_migrate migrate;
> > +
> > +	/* Sanity check the arguments */
> > +	start &= PAGE_MASK;
> > +	end &= PAGE_MASK;
> > +	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
> > +		return -EINVAL;
> > +	if (!vma || !ops || !pfns || start >= end)
> > +		return -EINVAL;
> > +	if (start < vma->vm_start || start >= vma->vm_end)
> > +		return -EINVAL;
> > +	if (end <= vma->vm_start || end > vma->vm_end)
> > +		return -EINVAL;
> > +
> > +	migrate.start = start;
> > +	migrate.pfns = pfns;
> > +	migrate.npages = 0;
> > +	migrate.end = end;
> > +	migrate.vma = vma;
> > +
> > +	/* Collect, and try to unmap source pages */
> > +	hmm_migrate_collect(&migrate);
> > +	if (!migrate.npages)
> > +		return 0;
> > +
> > +	/* Lock and isolate page */
> > +	hmm_migrate_lock_and_isolate(&migrate);
> > +	if (!migrate.npages)
> > +		return 0;
> > +
> > +	/* Unmap pages */
> > +	hmm_migrate_unmap(&migrate);
> > +	if (!migrate.npages)
> > +		return 0;
> > +
> > +	/*
> > +	 * At this point pages are lock and unmap and thus they have stable
> > +	 * content and can safely be copied to destination memory that is
> > +	 * allocated by the callback.
> > +	 *
> > +	 * Note that migration can fail in hmm_migrate_struct_page() for each
> > +	 * individual page.
> > +	 */
> > +	ops->alloc_and_copy(vma, start, end, pfns, private);
> 
> What is the expectation from alloc_and_copy()? Can it fail?

It can fail, there is no global status it is all handled on individual page
basis. So for instance if a device can only allocate its device memory as 64
chunk than it  can migrate any chunk that match this constraint and fail for
anything smaller than that.


> > +
> > +	/* This does the real migration of struct page */
> > +	hmm_migrate_struct_page(&migrate);
> > +
> > +	ops->finalize_and_map(vma, start, end, pfns, private);
> 
> Is this just notification to the driver or more?

Just a notification to driver.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
  2016-11-21  0:44   ` Balbir Singh
@ 2016-11-21  6:41   ` Anshuman Khandual
  2016-11-21 12:27     ` Jerome Glisse
  1 sibling, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21  6:41 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Russell King, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Martin Schwidefsky,
	Heiko Carstens, Yoshinori Sato, Rich Felker, Chris Metcalf,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> Only usefull for arch where we support ZONE_DEVICE and where we want to

A small nit s/usefull/useful/

> also support un-addressable device memory. We need struct page for such
> un-addressable memory. But we should avoid populating the kernel linear
> mapping for the physical address range because there is no real memory
> or anything behind those physical address.
> 
> Hence we need more flags than just knowing if it is device memory or not.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Chris Metcalf <cmetcalf@mellanox.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> ---
>  arch/ia64/mm/init.c            | 19 ++++++++++++++++---
>  arch/powerpc/mm/mem.c          | 18 +++++++++++++++---
>  arch/s390/mm/init.c            | 10 ++++++++--
>  arch/sh/mm/init.c              | 18 +++++++++++++++---
>  arch/tile/mm/init.c            | 10 ++++++++--
>  arch/x86/mm/init_32.c          | 19 ++++++++++++++++---
>  arch/x86/mm/init_64.c          | 19 ++++++++++++++++---
>  include/linux/memory_hotplug.h | 17 +++++++++++++++--
>  kernel/memremap.c              |  4 ++--
>  mm/memory_hotplug.c            |  4 ++--
>  10 files changed, 113 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 1841ef6..95a2fa5 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -645,7 +645,7 @@ mem_init (void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	pg_data_t *pgdat;
>  	struct zone *zone;
> @@ -653,10 +653,17 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> +		zone_for_memory(nid, start, size, ZONE_NORMAL,
> +				flags & MEMORY_DEVICE);
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
>  
>  	if (ret)
> @@ -667,13 +674,19 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (ret)
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 5f84433..e3c0532 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -126,7 +126,7 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
>  	return -ENODEV;
>  }
>  
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdata;
>  	struct zone *zone;
> @@ -134,6 +134,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int rc;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	pgdata = NODE_DATA(nid);
>  
>  	start = (unsigned long)__va(start);
> @@ -147,18 +153,24 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  
>  	/* this should work for most non-highmem platforms */
>  	zone = pgdata->node_zones +
> -		zone_for_memory(nid, start, size, 0, for_device);
> +		zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
>  
>  	return __add_pages(nid, zone, start_pfn, nr_pages);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
> +	
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
>  
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index f56a39b..4147b87 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -149,7 +149,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	unsigned long normal_end_pfn = PFN_DOWN(memblock_end_of_DRAM());
>  	unsigned long dma_end_pfn = PFN_DOWN(MAX_DMA_ADDRESS);
> @@ -158,6 +158,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
>  	unsigned long nr_pages;
>  	int rc, zone_enum;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	rc = vmem_add_mapping(start, size);
>  	if (rc)
>  		return rc;
> @@ -197,7 +203,7 @@ unsigned long memory_block_size_bytes(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	/*
>  	 * There is no hardware or firmware interface which could trigger a
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index 7549186..f72a402 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -485,19 +485,25 @@ void free_initrd_mem(unsigned long start, unsigned long end)
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	pg_data_t *pgdat;
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	pgdat = NODE_DATA(nid);
>  
>  	/* We only have ZONE_NORMAL, so this is easy.. */
>  	ret = __add_pages(nid, pgdat->node_zones +
>  			zone_for_memory(nid, start, size, ZONE_NORMAL,
> -			for_device),
> +					flags & MEMORY_DEVICE),
>  			start_pfn, nr_pages);
>  	if (unlikely(ret))
>  		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
> @@ -516,13 +522,19 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = PFN_DOWN(start);
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	ret = __remove_pages(zone, start_pfn, nr_pages);
>  	if (unlikely(ret))
> diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
> index adce254..5fd972c 100644
> --- a/arch/tile/mm/init.c
> +++ b/arch/tile/mm/init.c
> @@ -863,13 +863,19 @@ void __init mem_init(void)
>   * memory to the highmem for now.
>   */
>  #ifndef CONFIG_NEED_MULTIPLE_NODES
> -int arch_add_memory(u64 start, u64 size, bool for_device)
> +int arch_add_memory(u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdata = &contig_page_data;
>  	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	return __add_pages(zone, start_pfn, nr_pages);
>  }
>  
> @@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	/* TODO */
>  	return -EBUSY;
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index cf80590..16a9095 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -816,24 +816,37 @@ void __init mem_init(void)
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdata = NODE_DATA(nid);
>  	struct zone *zone = pgdata->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
> +		zone_for_memory(nid, start, size, ZONE_HIGHMEM,
> +				flags & MEMORY_DEVICE);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	return __add_pages(nid, zone, start_pfn, nr_pages);
>  }
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> -int arch_remove_memory(u64 start, u64 size)
> +int arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	struct zone *zone;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	zone = page_zone(pfn_to_page(start_pfn));
>  	return __remove_pages(zone, start_pfn, nr_pages);
>  }
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 14b9dd7..8c4abb0 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -651,15 +651,22 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
>   * Memory is added always to NORMAL zone. This means you will never get
>   * additional DMA/DMA32 memory.
>   */
> -int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
> +int arch_add_memory(int nid, u64 start, u64 size, int flags)
>  {
>  	struct pglist_data *pgdat = NODE_DATA(nid);
>  	struct zone *zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
> +		zone_for_memory(nid, start, size, ZONE_NORMAL,
> +				flags & MEMORY_DEVICE);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	init_memory_mapping(start, start + size);
>  
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
> @@ -956,7 +963,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>  	remove_pagetable(start, end, true);
>  }
>  
> -int __ref arch_remove_memory(u64 start, u64 size)
> +int __ref arch_remove_memory(u64 start, u64 size, int flags)
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> @@ -965,6 +972,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
>  	struct zone *zone;
>  	int ret;
>  
> +	/* Need to add support for device and unaddressable memory if needed */
> +	if (flags & MEMORY_UNADDRESSABLE) {
> +		BUG();
> +		return -EINVAL;
> +	}
> +
>  	/* With altmap the first mapped page is offset from @start */
>  	altmap = to_vmem_altmap((unsigned long) page);
>  	if (altmap)

So with this patch none of the architectures support un-addressable
memory but then support will be added through later patches ?
zone_for_memory function's flag now takes MEMORY_DEVICE parameter.
Then we need to change all the previous ZONE_DEVICE changes which
ever took "for_device" to accommodate this new flag ? just curious.

> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 01033fa..ba9b12e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -103,7 +103,7 @@ extern bool memhp_auto_online;
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern bool is_pageblock_removable_nolock(struct page *page);
> -extern int arch_remove_memory(u64 start, u64 size);
> +extern int arch_remove_memory(u64 start, u64 size, int flags);
>  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
>  	unsigned long nr_pages);
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -275,7 +275,20 @@ extern int add_memory(int nid, u64 start, u64 size);
>  extern int add_memory_resource(int nid, struct resource *resource, bool online);
>  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>  		bool for_device);
> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> +
> +/*
> + * For device memory we want more informations than just knowing it is device
> + * memory. We want to know if we can migrate it (ie it is not storage memory
> + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
> + * memory can not be access by CPU but we still want struct page so that we
> + * can use it like regular memory.

Some typos here. Needs to be cleaned up as well. But please have a
look at comment below over the classification itself.

> + */
> +#define MEMORY_FLAGS_NONE 0
> +#define MEMORY_DEVICE (1 << 0)
> +#define MEMORY_MOVABLE (1 << 1)
> +#define MEMORY_UNADDRESSABLE (1 << 2)

It should be DEVICE_MEMORY_* instead of MEMORY_* as we are trying to
classify device memory (though they are represented with struct page)
not regular system ram memory. This should attempt to classify device
memory which is backed by struct pages. arch_add_memory/arch_remove
_memory does not come into play if it's traditional device memory
which is just PFN and does not have struct page associated with it.

Broadly they are either CPU accessible or in-accessible. Storage
memory like persistent memory represented though ZONE_DEVICE fall
under the accessible (coherent) category. IIUC right now they are
not movable because page->pgmap replaces page->lru in struct page
hence its inability to be on standard LRU lists as one of the
reasons. As there was a need to have struct page to exploit more
core VM features on these memory going forward it will have to be
migratable one way or the other to accommodate features like
compaction, HW poison etc in these storage memory. Hence my point
here is lets not classify any of these memories as non-movable.
Just addressable or not should be the only classification.

> +
> +extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
>  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>  extern bool is_memblock_offlined(struct memory_block *mem);
>  extern void remove_memory(int nid, u64 start, u64 size);
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index b501e39..07665eb 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>  	/* pages are dead and unused, undo the arch mapping */
>  	align_start = res->start & ~(SECTION_SIZE - 1);
>  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
> -	arch_remove_memory(align_start, align_size);
> +	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
>  	pgmap_radix_release(res);
>  	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
> @@ -358,7 +358,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	if (error)
>  		goto err_pfn_remap;
>  
> -	error = arch_add_memory(nid, align_start, align_size, true);
> +	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
>  	if (error)
>  		goto err_add_memory;
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9629273..b2942d7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1386,7 +1386,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
>  	}
>  
>  	/* call arch's memory hotadd */
> -	ret = arch_add_memory(nid, start, size, false);
> +	ret = arch_add_memory(nid, start, size, MEMORY_FLAGS_NONE);
>  
>  	if (ret < 0)
>  		goto error;
> @@ -2205,7 +2205,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	memblock_free(start, size);
>  	memblock_remove(start, size);
>  
> -	arch_remove_memory(start, size);
> +	arch_remove_memory(start, size, MEMORY_FLAGS_NONE);

Right, these are system RAM hotplug paths and device memory should
never get plugged in/out from this path.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-21  4:53     ` Jerome Glisse
@ 2016-11-21  6:57       ` Anshuman Khandual
  2016-11-21 12:19         ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21  6:57 UTC (permalink / raw)
  To: Jerome Glisse, Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On 11/21/2016 10:23 AM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 11:44:36AM +1100, Balbir Singh wrote:
>>
>>
>> On 19/11/16 05:18, Jérôme Glisse wrote:
>>> Only usefull for arch where we support ZONE_DEVICE and where we want to
>>> also support un-addressable device memory. We need struct page for such
>>> un-addressable memory. But we should avoid populating the kernel linear
>>> mapping for the physical address range because there is no real memory
>>> or anything behind those physical address.
>>>
>>> Hence we need more flags than just knowing if it is device memory or not.
>>>
>>
>>
>> Isn't it better to add a wrapper to arch_add/remove_memory and do those
>> checks inside and then call arch_add/remove_memory to reduce the churn.
>> If you need selectively enable MEMORY_UNADDRESSABLE that can be done with
>> _ARCH_HAS_FEATURE
> 
> The flag parameter can be use by other new features and thus i thought the
> churn was fine. But i do not mind either way, whatever people like best.

Right, once we get the device memory classification right, these flags
can be used in more places.

> 
> [...]
> 
>>> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
>>> +
>>> +/*
>>> + * For device memory we want more informations than just knowing it is device
>> 				     information
>>> + * memory. We want to know if we can migrate it (ie it is not storage memory
>>> + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
>>> + * memory can not be access by CPU but we still want struct page so that we
>> 			accessed
>>> + * can use it like regular memory.
>>
>> Can you please add some details on why -- migration needs them for example?
> 
> I am not sure what you mean ? DAX ie persistent memory device is intended to be
> use for filesystem or persistent storage. Hence memory migration does not apply
> to it (it would go against its purpose).

Why ? It can still be used for compaction, HW errors etc where we need to
move between persistent storage areas. The source and destination can be
persistent storage memory.

> 
> So i want to extend ZONE_DEVICE to be more then just DAX/persistent memory. For
> that i need to differentatiate between device memory that can be migrated and
> should be more or less treated like regular memory (with struct page). This is
> what the MEMORY_MOVABLE flag is for.

ZONE_DEVICE right now also supports struct page for the addressable memory,
(whether inside it's own range or in system RAM) with this we are extending
it to cover un-addressable memory with struct pages. Yes the differentiation
is required.

> 
> Finaly in my case the device memory is not accessible by the CPU so i need yet
> another flag. In the end i am extending ZONE_DEVICE to be use for 3 differents
> type of memory.
> 
> Is this the kind of explanation you are looking for ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-11-18 18:18 ` [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
@ 2016-11-21  8:06   ` Anshuman Khandual
  2016-11-21 12:33     ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21  8:06 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> This add support for un-addressable device memory. Such memory is hotpluged
> only so we can have struct page but should never be map. This patch add code

struct pages inside the system RAM range unlike the vmem_altmap scheme
where the struct pages can be inside the device memory itself. This
possibility does not arise for un addressable device memory. May be we
will have to block the paths where vmem_altmap is requested along with
un addressable device memory.

> to mm page fault code path to catch any such mapping and SIGBUS on such event.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  drivers/dax/pmem.c                |  3 ++-
>  drivers/nvdimm/pmem.c             |  5 +++--
>  include/linux/memremap.h          | 23 ++++++++++++++++++++---
>  kernel/memremap.c                 | 12 +++++++++---
>  mm/memory.c                       |  9 +++++++++
>  tools/testing/nvdimm/test/iomap.c |  2 +-
>  6 files changed, 44 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
> index 1f01e98..1b42aef 100644
> --- a/drivers/dax/pmem.c
> +++ b/drivers/dax/pmem.c
> @@ -107,7 +107,8 @@ static int dax_pmem_probe(struct device *dev)
>  	if (rc)
>  		return rc;
>  
> -	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap);
> +	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
> +				   altmap, NULL, MEMORY_DEVICE);
>  	if (IS_ERR(addr))
>  		return PTR_ERR(addr);
>  
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 571a6c7..5ffd937 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -260,7 +260,7 @@ static int pmem_attach_disk(struct device *dev,
>  	pmem->pfn_flags = PFN_DEV;
>  	if (is_nd_pfn(dev)) {
>  		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
> -				altmap);
> +					   altmap, NULL, MEMORY_DEVICE);
>  		pfn_sb = nd_pfn->pfn_sb;
>  		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
>  		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
> @@ -269,7 +269,8 @@ static int pmem_attach_disk(struct device *dev,
>  		res->start += pmem->data_offset;
>  	} else if (pmem_should_map_pages(dev)) {
>  		addr = devm_memremap_pages(dev, &nsio->res,
> -				&q->q_usage_counter, NULL);
> +					   &q->q_usage_counter,
> +					   NULL, NULL, MEMORY_DEVICE);
>  		pmem->pfn_flags |= PFN_MAP;
>  	} else
>  		addr = devm_memremap(dev, pmem->phys_addr,
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 9341619..fe61dca 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -41,22 +41,34 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)

^^^^^^^^^^^^^ device memory flags instead ?

>   */
>  struct dev_pagemap {
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
>  	struct device *dev;
> +	int flags;
>  };
>  
>  #ifdef CONFIG_ZONE_DEVICE
>  void *devm_memremap_pages(struct device *dev, struct resource *res,
> -		struct percpu_ref *ref, struct vmem_altmap *altmap);
> +			  struct percpu_ref *ref, struct vmem_altmap *altmap,
> +			  struct dev_pagemap **ppgmap, int flags);
>  struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
> +
> +static inline bool is_addressable_page(const struct page *page)
> +{
> +	return ((page_zonenum(page) != ZONE_DEVICE) ||
> +		!(page->pgmap->flags & MEMORY_UNADDRESSABLE));
> +}
>  #else
>  static inline void *devm_memremap_pages(struct device *dev,
> -		struct resource *res, struct percpu_ref *ref,
> -		struct vmem_altmap *altmap)
> +					struct resource *res,
> +					struct percpu_ref *ref,
> +					struct vmem_altmap *altmap,
> +					struct dev_pagemap **ppgmap,
> +					int flags)


As I had mentioned before devm_memremap_pages() should be changed not
to accept a valid altmap along with request for un-addressable memory.

>  {
>  	/*
>  	 * Fail attempts to call devm_memremap_pages() without
> @@ -71,6 +83,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>  {
>  	return NULL;
>  }
> +
> +static inline bool is_addressable_page(const struct page *page)
> +{
> +	return true;
> +}
>  #endif
>  
>  /**
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07665eb..438a73aa2 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>  	/* pages are dead and unused, undo the arch mapping */
>  	align_start = res->start & ~(SECTION_SIZE - 1);
>  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
> -	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
> +	arch_remove_memory(align_start, align_size, pgmap->flags);
>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
>  	pgmap_radix_release(res);
>  	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
> @@ -270,6 +270,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>   * @res: "host memory" address range
>   * @ref: a live per-cpu reference count
>   * @altmap: optional descriptor for allocating the memmap from @res
> + * @ppgmap: pointer set to new page dev_pagemap on success
> + * @flags: flag for memory (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
>   *
>   * Notes:
>   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
> @@ -280,7 +282,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>   *    this is not enforced.
>   */
>  void *devm_memremap_pages(struct device *dev, struct resource *res,
> -		struct percpu_ref *ref, struct vmem_altmap *altmap)
> +			  struct percpu_ref *ref, struct vmem_altmap *altmap,
> +			  struct dev_pagemap **ppgmap, int flags)
>  {
>  	resource_size_t key, align_start, align_size, align_end;
>  	pgprot_t pgprot = PAGE_KERNEL;
> @@ -322,6 +325,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	}
>  	pgmap->ref = ref;
>  	pgmap->res = &page_map->res;
> +	pgmap->flags = flags | MEMORY_DEVICE;

So the caller of devm_memremap_pages() should not have give out MEMORY_DEVICE
in the flag it passed on to this function ? Hmm, else we should just check
that the flags contains all appropriate bits before proceeding.

>  
>  	mutex_lock(&pgmap_lock);
>  	error = 0;
> @@ -358,7 +362,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	if (error)
>  		goto err_pfn_remap;
>  
> -	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
> +	error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
>  	if (error)
>  		goto err_add_memory;
>  
> @@ -375,6 +379,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  		page->pgmap = pgmap;
>  	}
>  	devres_add(dev, page_map);
> +	if (ppgmap)
> +		*ppgmap = pgmap;
>  	return __va(res->start);
>  
>   err_add_memory:
> diff --git a/mm/memory.c b/mm/memory.c
> index 840adc6..15f2908 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -45,6 +45,7 @@
>  #include <linux/swap.h>
>  #include <linux/highmem.h>
>  #include <linux/pagemap.h>
> +#include <linux/memremap.h>
>  #include <linux/ksm.h>
>  #include <linux/rmap.h>
>  #include <linux/export.h>
> @@ -3482,6 +3483,7 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
>  static int handle_pte_fault(struct fault_env *fe)
>  {
>  	pte_t entry;
> +	struct page *page;
>  
>  	if (unlikely(pmd_none(*fe->pmd))) {
>  		/*
> @@ -3533,6 +3535,13 @@ static int handle_pte_fault(struct fault_env *fe)
>  	if (pte_protnone(entry) && vma_is_accessible(fe->vma))
>  		return do_numa_page(fe, entry);
>  
> +	/* Catch mapping of un-addressable memory this should never happen */
> +	page = pfn_to_page(pte_pfn(entry));
> +	if (!is_addressable_page(page)) {
> +		print_bad_pte(fe->vma, fe->address, entry, page);
> +		return VM_FAULT_SIGBUS;
> +	}

Right, core VM should never put an un-addressable page in the page table.

> +
>  	fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
>  	spin_lock(fe->ptl);
>  	if (unlikely(!pte_same(*fe->pte, entry)))
> diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
> index c29f8dc..899d6a8 100644
> --- a/tools/testing/nvdimm/test/iomap.c
> +++ b/tools/testing/nvdimm/test/iomap.c
> @@ -108,7 +108,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,
>  
>  	if (nfit_res)
>  		return nfit_res->buf + offset - nfit_res->res->start;
> -	return devm_memremap_pages(dev, res, ref, altmap);
> +	return devm_memremap_pages(dev, res, ref, altmap, MEMORY_DEVICE);
>  }
>  EXPORT_SYMBOL(__wrap_devm_memremap_pages);
>  
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages
  2016-11-18 18:18 ` [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
@ 2016-11-21  8:18   ` Anshuman Khandual
  2016-11-21 12:50     ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21  8:18 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
> happen as ZONE_DEVICE page must always have an elevated refcount.
> 
> This is to catch refcounting issues in a sane way for ZONE_DEVICE pages.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  mm/page_alloc.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead..09b2630 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2435,6 +2435,16 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> +	/*
> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
> +	 * active refcount. Complain about it and try to restore the refcount.
> +	 */
> +	if (is_zone_device_page(page)) {
> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
> +		page_ref_inc(page);
> +		return;
> +	}

This fixes an issue in the existing ZONE_DEVICE code, should not this
patch be sent separately not in this series ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-11-18 18:18 ` [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
  2016-11-21  1:49   ` Balbir Singh
@ 2016-11-21  8:26   ` Anshuman Khandual
  2016-11-21 12:34     ` Jerome Glisse
  1 sibling, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21  8:26 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
> is holding a reference on it (only device to which the memory belong do).
> Add a callback and call it when that happen so device driver can implement
> their own free page management.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/memremap.h | 4 ++++
>  kernel/memremap.c        | 8 ++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index fe61dca..469c88d 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -37,17 +37,21 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>  
>  /**
>   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> + * @free_devpage: free page callback when page refcount reach 1
>   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>   * @res: physical address range covered by @ref
>   * @ref: reference count that pins the devm_memremap_pages() mapping
>   * @dev: host device of the mapping for debug
> + * @data: privata data pointer for free_devpage
>   * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
>   */
>  struct dev_pagemap {
> +	void (*free_devpage)(struct page *page, void *data);
>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
>  	struct device *dev;
> +	void *data;
>  	int flags;
>  };
>  
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 438a73aa2..3d28048 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
>  
>  void put_zone_device_page(struct page *page)
>  {
> +	/*
> +	 * If refcount is 1 then page is freed and refcount is stable as nobody
> +	 * holds a reference on the page.
> +	 */
> +	if (page->pgmap->free_devpage && page_count(page) == 1)
> +		page->pgmap->free_devpage(page, page->pgmap->data);
>  	put_dev_pagemap(page->pgmap);
>  }
>  EXPORT_SYMBOL(put_zone_device_page);
> @@ -326,6 +332,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  	pgmap->ref = ref;
>  	pgmap->res = &page_map->res;
>  	pgmap->flags = flags | MEMORY_DEVICE;
> +	pgmap->free_devpage = NULL;
> +	pgmap->data = NULL;

When is the driver expected to load up pgmap->free_devpage ? I thought
this function is one of the right places. Though as all the pages in
the same hotplug operation point to the same dev_pagemap structure this
loading can be done at later point of time as well.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2016-11-18 18:18 ` [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
@ 2016-11-21 10:37   ` Anshuman Khandual
  2016-11-21 12:39     ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21 10:37 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> HMM wants to remove device memory early before device tear down so add an
> helper to do that.

Could you please explain why HMM wants to remove device memory before
device tear down ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-18 18:18 ` [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable Jérôme Glisse
  2016-11-21  2:06   ` Balbir Singh
@ 2016-11-21 10:58   ` Anshuman Khandual
  2016-11-21 12:42     ` Jerome Glisse
  1 sibling, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21 10:58 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> To allow use of device un-addressable memory inside a process add a
> special swap type. Also add a new callback to handle page fault on
> such entry.

IIUC this swap type is required only for the mirror cases and its
not a requirement for migration. If it's required for mirroring
purpose where we intercept each page fault, the commit message
here should clearly elaborate on that more.

> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/proc/task_mmu.c       | 10 +++++++-
>  include/linux/memremap.h |  5 ++++
>  include/linux/swap.h     | 18 ++++++++++---
>  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/memremap.c        | 14 ++++++++++
>  mm/Kconfig               | 12 +++++++++
>  mm/memory.c              | 24 +++++++++++++++++
>  mm/mprotect.c            | 12 +++++++++
>  8 files changed, 158 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6909582..0726d39 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  			} else {
>  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
>  			}
> -		} else if (is_migration_entry(swpent))
> +		} else if (is_migration_entry(swpent)) {
>  			page = migration_entry_to_page(swpent);
> +		} else if (is_device_entry(swpent)) {
> +			page = device_entry_to_page(swpent);
> +		}
>  	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
>  							&& pte_none(*pte))) {
>  		page = find_get_entry(vma->vm_file->f_mapping,
> @@ -708,6 +711,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>  
>  		if (is_migration_entry(swpent))
>  			page = migration_entry_to_page(swpent);
> +		if (is_device_entry(swpent))
> +			page = device_entry_to_page(swpent);
>  	}
>  	if (page) {
>  		int mapcount = page_mapcount(page);
> @@ -1191,6 +1196,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>  		flags |= PM_SWAP;
>  		if (is_migration_entry(entry))
>  			page = migration_entry_to_page(entry);
> +
> +		if (is_device_entry(entry))
> +			page = device_entry_to_page(entry);
>  	}
>  
>  	if (page && !PageAnon(page))
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index b6f03e9..d584c74 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -47,6 +47,11 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>   */
>  struct dev_pagemap {
>  	void (*free_devpage)(struct page *page, void *data);
> +	int (*fault)(struct vm_area_struct *vma,
> +		     unsigned long addr,
> +		     struct page *page,
> +		     unsigned flags,
> +		     pmd_t *pmdp);

We are extending the dev_pagemap once again to accommodate device driver
specific fault routines for these pages. Wondering if this extension and
the new swap type should be in the same patch.

>  	struct vmem_altmap *altmap;
>  	const struct resource *res;
>  	struct percpu_ref *ref;
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7e553e1..599cb54 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
>   */
>  
>  /*
> + * Un-addressable device memory support
> + */
> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> +#define SWP_DEVICE_NUM 2
> +#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
> +#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
> +#else
> +#define SWP_DEVICE_NUM 0
> +#endif
> +
> +/*
>   * NUMA node memory migration support
>   */
>  #ifdef CONFIG_MIGRATION
> @@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
>  #endif
>  
>  #define MAX_SWAPFILES \
> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>  
>  /*
>   * Magic header for a swap area. The first part of the union is
> @@ -442,8 +454,8 @@ static inline void show_swap_cache_info(void)
>  {
>  }
>  
> -#define free_swap_and_cache(swp)	is_migration_entry(swp)
> -#define swapcache_prepare(swp)		is_migration_entry(swp)
> +#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
> +#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
>  
>  static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
>  {
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 5c3a5f3..d1aa425 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>  	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
>  }
>  
> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> +{
> +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
> +}
> +
> +static inline bool is_device_entry(swp_entry_t entry)
> +{
> +	int type = swp_type(entry);
> +	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
> +}
> +
> +static inline void make_device_entry_read(swp_entry_t *entry)
> +{
> +	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
> +}
> +
> +static inline bool is_write_device_entry(swp_entry_t entry)
> +{
> +	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
> +}
> +
> +static inline struct page *device_entry_to_page(swp_entry_t entry)
> +{
> +	return pfn_to_page(swp_offset(entry));
> +}
> +
> +int device_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned flags,
> +		       pmd_t *pmdp);
> +#else /* CONFIG_DEVICE_UNADDRESSABLE */
> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> +{
> +	return swp_entry(0, 0);
> +}
> +
> +static inline void make_device_entry_read(swp_entry_t *entry)
> +{
> +}
> +
> +static inline bool is_device_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline bool is_write_device_entry(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
> +static inline struct page *device_entry_to_page(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
> +static inline int device_entry_fault(struct vm_area_struct *vma,
> +				     unsigned long addr,
> +				     swp_entry_t entry,
> +				     unsigned flags,
> +				     pmd_t *pmdp)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif /* CONFIG_DEVICE_UNADDRESSABLE */
> +
>  #ifdef CONFIG_MIGRATION
>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>  {
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index cf83928..0670015 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -18,6 +18,8 @@
>  #include <linux/io.h>
>  #include <linux/mm.h>
>  #include <linux/memory_hotplug.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
>  
>  #ifndef ioremap_cache
>  /* temporary while we convert existing ioremap_cache users to memremap */
> @@ -200,6 +202,18 @@ void put_zone_device_page(struct page *page)
>  }
>  EXPORT_SYMBOL(put_zone_device_page);
>  
> +int device_entry_fault(struct vm_area_struct *vma,
> +		       unsigned long addr,
> +		       swp_entry_t entry,
> +		       unsigned flags,
> +		       pmd_t *pmdp)
> +{
> +	struct page *page = device_entry_to_page(entry);
> +

A BUG_ON() if page->pgmap->fault has not been populated by the driver.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-21  2:06   ` Balbir Singh
  2016-11-21  5:05     ` Jerome Glisse
@ 2016-11-21 11:10     ` Anshuman Khandual
  1 sibling, 0 replies; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-21 11:10 UTC (permalink / raw)
  To: Balbir Singh, Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 11/21/2016 07:36 AM, Balbir Singh wrote:
> 
> 
> On 19/11/16 05:18, Jérôme Glisse wrote:
>> To allow use of device un-addressable memory inside a process add a
>> special swap type. Also add a new callback to handle page fault on
>> such entry.
>>
>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> ---
>>  fs/proc/task_mmu.c       | 10 +++++++-
>>  include/linux/memremap.h |  5 ++++
>>  include/linux/swap.h     | 18 ++++++++++---
>>  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/memremap.c        | 14 ++++++++++
>>  mm/Kconfig               | 12 +++++++++
>>  mm/memory.c              | 24 +++++++++++++++++
>>  mm/mprotect.c            | 12 +++++++++
>>  8 files changed, 158 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 6909582..0726d39 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>>  			} else {
>>  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
>>  			}
>> -		} else if (is_migration_entry(swpent))
>> +		} else if (is_migration_entry(swpent)) {
>>  			page = migration_entry_to_page(swpent);
>> +		} else if (is_device_entry(swpent)) {
>> +			page = device_entry_to_page(swpent);
>> +		}
> 
> 
> So the reason there is a device swap entry for a page belonging to a user process is
> that it is in the middle of migration or is it always that a swap entry represents
> unaddressable memory belonging to a GPU device, but its tracked in the page table
> entries of the process.

I guess the later is the case and its used for the page table mirroring
purpose after intercepting the page faults. But will leave upto Jerome
to explain more on this.

> 
>>  	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
>>  							&& pte_none(*pte))) {
>>  		page = find_get_entry(vma->vm_file->f_mapping,
>> @@ -708,6 +711,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>>  
>>  		if (is_migration_entry(swpent))
>>  			page = migration_entry_to_page(swpent);
>> +		if (is_device_entry(swpent))
>> +			page = device_entry_to_page(swpent);
>>  	}
>>  	if (page) {
>>  		int mapcount = page_mapcount(page);
>> @@ -1191,6 +1196,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>>  		flags |= PM_SWAP;
>>  		if (is_migration_entry(entry))
>>  			page = migration_entry_to_page(entry);
>> +
>> +		if (is_device_entry(entry))
>> +			page = device_entry_to_page(entry);
>>  	}
>>  
>>  	if (page && !PageAnon(page))
>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> index b6f03e9..d584c74 100644
>> --- a/include/linux/memremap.h
>> +++ b/include/linux/memremap.h
>> @@ -47,6 +47,11 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>>   */
>>  struct dev_pagemap {
>>  	void (*free_devpage)(struct page *page, void *data);
>> +	int (*fault)(struct vm_area_struct *vma,
>> +		     unsigned long addr,
>> +		     struct page *page,
>> +		     unsigned flags,
>> +		     pmd_t *pmdp);
>>  	struct vmem_altmap *altmap;
>>  	const struct resource *res;
>>  	struct percpu_ref *ref;
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 7e553e1..599cb54 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
>>   */
>>  
>>  /*
>> + * Un-addressable device memory support
>> + */
>> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
>> +#define SWP_DEVICE_NUM 2
>> +#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
>> +#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
>> +#else
>> +#define SWP_DEVICE_NUM 0
>> +#endif
>> +
>> +/*
>>   * NUMA node memory migration support
>>   */
>>  #ifdef CONFIG_MIGRATION
>> @@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
>>  #endif
>>  
>>  #define MAX_SWAPFILES \
>> -	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>> +	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
>> +	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
>>  
>>  /*
>>   * Magic header for a swap area. The first part of the union is
>> @@ -442,8 +454,8 @@ static inline void show_swap_cache_info(void)
>>  {
>>  }
>>  
>> -#define free_swap_and_cache(swp)	is_migration_entry(swp)
>> -#define swapcache_prepare(swp)		is_migration_entry(swp)
>> +#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
>> +#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
>>  
>>  static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
>>  {
>> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
>> index 5c3a5f3..d1aa425 100644
>> --- a/include/linux/swapops.h
>> +++ b/include/linux/swapops.h
>> @@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
>>  	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
>>  }
>>  
>> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
>> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
>> +{
>> +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
> 
> Code style checks
> 
>> +}
>> +
>> +static inline bool is_device_entry(swp_entry_t entry)
>> +{
>> +	int type = swp_type(entry);
>> +	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
>> +}
>> +
>> +static inline void make_device_entry_read(swp_entry_t *entry)
>> +{
>> +	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
>> +}
>> +
>> +static inline bool is_write_device_entry(swp_entry_t entry)
>> +{
>> +	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
>> +}
>> +
>> +static inline struct page *device_entry_to_page(swp_entry_t entry)
>> +{
>> +	return pfn_to_page(swp_offset(entry));
>> +}
>> +
>> +int device_entry_fault(struct vm_area_struct *vma,
>> +		       unsigned long addr,
>> +		       swp_entry_t entry,
>> +		       unsigned flags,
>> +		       pmd_t *pmdp);
>> +#else /* CONFIG_DEVICE_UNADDRESSABLE */
>> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
>> +{
>> +	return swp_entry(0, 0);
>> +}
>> +
>> +static inline void make_device_entry_read(swp_entry_t *entry)
>> +{
>> +}
>> +
>> +static inline bool is_device_entry(swp_entry_t entry)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline bool is_write_device_entry(swp_entry_t entry)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline struct page *device_entry_to_page(swp_entry_t entry)
>> +{
>> +	return NULL;
>> +}
>> +
>> +static inline int device_entry_fault(struct vm_area_struct *vma,
>> +				     unsigned long addr,
>> +				     swp_entry_t entry,
>> +				     unsigned flags,
>> +				     pmd_t *pmdp)
>> +{
>> +	return VM_FAULT_SIGBUS;
>> +}
>> +#endif /* CONFIG_DEVICE_UNADDRESSABLE */
>> +
>>  #ifdef CONFIG_MIGRATION
>>  static inline swp_entry_t make_migration_entry(struct page *page, int write)
>>  {
>> diff --git a/kernel/memremap.c b/kernel/memremap.c
>> index cf83928..0670015 100644
>> --- a/kernel/memremap.c
>> +++ b/kernel/memremap.c
>> @@ -18,6 +18,8 @@
>>  #include <linux/io.h>
>>  #include <linux/mm.h>
>>  #include <linux/memory_hotplug.h>
>> +#include <linux/swap.h>
>> +#include <linux/swapops.h>
>>  
>>  #ifndef ioremap_cache
>>  /* temporary while we convert existing ioremap_cache users to memremap */
>> @@ -200,6 +202,18 @@ void put_zone_device_page(struct page *page)
>>  }
>>  EXPORT_SYMBOL(put_zone_device_page);
>>  
>> +int device_entry_fault(struct vm_area_struct *vma,
>> +		       unsigned long addr,
>> +		       swp_entry_t entry,
>> +		       unsigned flags,
>> +		       pmd_t *pmdp)
>> +{
>> +	struct page *page = device_entry_to_page(entry);
>> +
>> +	return page->pgmap->fault(vma, addr, page, flags, pmdp);
>> +}
>> +EXPORT_SYMBOL(device_entry_fault);
>> +
>>  static void pgmap_radix_release(struct resource *res)
>>  {
>>  	resource_size_t key, align_start, align_size, align_end;
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index be0ee11..0a21411 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -704,6 +704,18 @@ config ZONE_DEVICE
>>  
>>  	  If FS_DAX is enabled, then say Y.
>>  
>> +config DEVICE_UNADDRESSABLE
>> +	bool "Un-addressable device memory (GPU memory, ...)"
>> +	depends on ZONE_DEVICE
>> +
>> +	help
>> +	  Allow to create struct page for un-addressable device memory
>> +	  ie memory that is only accessible by the device (or group of
>> +	  devices).
>> +
>> +	  This allow to migrate chunk of process memory to device memory
>> +	  while that memory is use by the device.
>> +
>>  config FRAME_VECTOR
>>  	bool
>>  
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 15f2908..a83d690 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -889,6 +889,21 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>>  					pte = pte_swp_mksoft_dirty(pte);
>>  				set_pte_at(src_mm, addr, src_pte, pte);
>>  			}
>> +		} else if (is_device_entry(entry)) {
>> +			page = device_entry_to_page(entry);
>> +
>> +			get_page(page);
>> +			rss[mm_counter(page)]++;
> 
> Why does rss count go up?
> 
>> +			page_dup_rmap(page, false);
>> +
>> +			if (is_write_device_entry(entry) &&
>> +			    is_cow_mapping(vm_flags)) {
>> +				make_device_entry_read(&entry);
>> +				pte = swp_entry_to_pte(entry);
>> +				if (pte_swp_soft_dirty(*src_pte))
>> +					pte = pte_swp_mksoft_dirty(pte);
>> +				set_pte_at(src_mm, addr, src_pte, pte);
>> +			}
>>  		}
>>  		goto out_set_pte;
>>  	}
>> @@ -1191,6 +1206,12 @@ again:
>>  
>>  			page = migration_entry_to_page(entry);
>>  			rss[mm_counter(page)]--;
>> +		} else if (is_device_entry(entry)) {
>> +			struct page *page = device_entry_to_page(entry);
>> +			rss[mm_counter(page)]--;
>> +
>> +			page_remove_rmap(page, false);
>> +			put_page(page);
>>  		}
>>  		if (unlikely(!free_swap_and_cache(entry)))
>>  			print_bad_pte(vma, addr, ptent, NULL);
>> @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
>>  	if (unlikely(non_swap_entry(entry))) {
>>  		if (is_migration_entry(entry)) {
>>  			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
>> +		} else if (is_device_entry(entry)) {
>> +			ret = device_entry_fault(vma, fe->address, entry,
>> +						 fe->flags, fe->pmd);
> 
> What does device_entry_fault() actually do here?

IIUC it calls page->pgmap->fault() which is device specific page fault for
the page and thats how the control reaches device driver from the core VM.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-21  6:57       ` Anshuman Khandual
@ 2016-11-21 12:19         ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:19 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Balbir Singh, akpm, linux-kernel, linux-mm, John Hubbard,
	Russell King, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Martin Schwidefsky, Heiko Carstens,
	Yoshinori Sato, Rich Felker, Chris Metcalf, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

On Mon, Nov 21, 2016 at 12:27:15PM +0530, Anshuman Khandual wrote:
> On 11/21/2016 10:23 AM, Jerome Glisse wrote:
> > On Mon, Nov 21, 2016 at 11:44:36AM +1100, Balbir Singh wrote:
> >>
> >>
> >> On 19/11/16 05:18, Jérôme Glisse wrote:
> >>> Only usefull for arch where we support ZONE_DEVICE and where we want to
> >>> also support un-addressable device memory. We need struct page for such
> >>> un-addressable memory. But we should avoid populating the kernel linear
> >>> mapping for the physical address range because there is no real memory
> >>> or anything behind those physical address.
> >>>
> >>> Hence we need more flags than just knowing if it is device memory or not.
> >>>
> >>
> >>
> >> Isn't it better to add a wrapper to arch_add/remove_memory and do those
> >> checks inside and then call arch_add/remove_memory to reduce the churn.
> >> If you need selectively enable MEMORY_UNADDRESSABLE that can be done with
> >> _ARCH_HAS_FEATURE
> > 
> > The flag parameter can be use by other new features and thus i thought the
> > churn was fine. But i do not mind either way, whatever people like best.
> 
> Right, once we get the device memory classification right, these flags
> can be used in more places.
> 
> > 
> > [...]
> > 
> >>> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> >>> +
> >>> +/*
> >>> + * For device memory we want more informations than just knowing it is device
> >> 				     information
> >>> + * memory. We want to know if we can migrate it (ie it is not storage memory
> >>> + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
> >>> + * memory can not be access by CPU but we still want struct page so that we
> >> 			accessed
> >>> + * can use it like regular memory.
> >>
> >> Can you please add some details on why -- migration needs them for example?
> > 
> > I am not sure what you mean ? DAX ie persistent memory device is intended to be
> > use for filesystem or persistent storage. Hence memory migration does not apply
> > to it (it would go against its purpose).
> 
> Why ? It can still be used for compaction, HW errors etc where we need to
> move between persistent storage areas. The source and destination can be
> persistent storage memory.

Well i don't think they intend to do migration for that, HW errors are hidden by
hardware itself so far. But yes if they care then they could use page migration.
So far it does not seems something any of the folks working on persistent memory
think useful.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-21  6:41   ` Anshuman Khandual
@ 2016-11-21 12:27     ` Jerome Glisse
  2016-11-22  5:35       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:27 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Mon, Nov 21, 2016 at 12:11:50PM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:

[...]

> > @@ -956,7 +963,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> >  	remove_pagetable(start, end, true);
> >  }
> >  
> > -int __ref arch_remove_memory(u64 start, u64 size)
> > +int __ref arch_remove_memory(u64 start, u64 size, int flags)
> >  {
> >  	unsigned long start_pfn = start >> PAGE_SHIFT;
> >  	unsigned long nr_pages = size >> PAGE_SHIFT;
> > @@ -965,6 +972,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
> >  	struct zone *zone;
> >  	int ret;
> >  
> > +	/* Need to add support for device and unaddressable memory if needed */
> > +	if (flags & MEMORY_UNADDRESSABLE) {
> > +		BUG();
> > +		return -EINVAL;
> > +	}
> > +
> >  	/* With altmap the first mapped page is offset from @start */
> >  	altmap = to_vmem_altmap((unsigned long) page);
> >  	if (altmap)
> 
> So with this patch none of the architectures support un-addressable
> memory but then support will be added through later patches ?
> zone_for_memory function's flag now takes MEMORY_DEVICE parameter.
> Then we need to change all the previous ZONE_DEVICE changes which
> ever took "for_device" to accommodate this new flag ? just curious.

Yes correct.


> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 01033fa..ba9b12e 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -103,7 +103,7 @@ extern bool memhp_auto_online;
> >  
> >  #ifdef CONFIG_MEMORY_HOTREMOVE
> >  extern bool is_pageblock_removable_nolock(struct page *page);
> > -extern int arch_remove_memory(u64 start, u64 size);
> > +extern int arch_remove_memory(u64 start, u64 size, int flags);
> >  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
> >  	unsigned long nr_pages);
> >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> > @@ -275,7 +275,20 @@ extern int add_memory(int nid, u64 start, u64 size);
> >  extern int add_memory_resource(int nid, struct resource *resource, bool online);
> >  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
> >  		bool for_device);
> > -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> > +
> > +/*
> > + * For device memory we want more informations than just knowing it is device
> > + * memory. We want to know if we can migrate it (ie it is not storage memory
> > + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
> > + * memory can not be access by CPU but we still want struct page so that we
> > + * can use it like regular memory.
> 
> Some typos here. Needs to be cleaned up as well. But please have a
> look at comment below over the classification itself.
> 
> > + */
> > +#define MEMORY_FLAGS_NONE 0
> > +#define MEMORY_DEVICE (1 << 0)
> > +#define MEMORY_MOVABLE (1 << 1)
> > +#define MEMORY_UNADDRESSABLE (1 << 2)
> 
> It should be DEVICE_MEMORY_* instead of MEMORY_* as we are trying to
> classify device memory (though they are represented with struct page)
> not regular system ram memory. This should attempt to classify device
> memory which is backed by struct pages. arch_add_memory/arch_remove
> _memory does not come into play if it's traditional device memory
> which is just PFN and does not have struct page associated with it.

Good idea i will change that.


> Broadly they are either CPU accessible or in-accessible. Storage
> memory like persistent memory represented though ZONE_DEVICE fall
> under the accessible (coherent) category. IIUC right now they are
> not movable because page->pgmap replaces page->lru in struct page
> hence its inability to be on standard LRU lists as one of the
> reasons. As there was a need to have struct page to exploit more
> core VM features on these memory going forward it will have to be
> migratable one way or the other to accommodate features like
> compaction, HW poison etc in these storage memory. Hence my point
> here is lets not classify any of these memories as non-movable.
> Just addressable or not should be the only classification.

Being on the lru or not is not and issue in respect to migration. Being
on the lru was use as an indication that the page is manage through the
standard mm code and thus that many assumptions hold which in turn do
allow migration. But if one use device memory following all rules of
regular memory then migration can be done to no matter if page is on
lru or not.

I still think that the MOVABLE is an important distinction as i am pretty
sure that the persistent folks do not want to see their page migrated in
anyway. I might rename it to DEVICE_MEMORY_ALLOW_MIGRATION.

Cheers,
Jérôme 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-11-21  8:06   ` Anshuman Khandual
@ 2016-11-21 12:33     ` Jerome Glisse
  2016-11-22  5:15       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:33 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 01:36:57PM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > This add support for un-addressable device memory. Such memory is hotpluged
> > only so we can have struct page but should never be map. This patch add code
> 
> struct pages inside the system RAM range unlike the vmem_altmap scheme
> where the struct pages can be inside the device memory itself. This
> possibility does not arise for un addressable device memory. May be we
> will have to block the paths where vmem_altmap is requested along with
> un addressable device memory.

I did not think checking for that explicitly was necessary, sounded like shooting
yourself in the foot and that it would be obvious :)

[...]

> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 9341619..fe61dca 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -41,22 +41,34 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >   * @res: physical address range covered by @ref
> >   * @ref: reference count that pins the devm_memremap_pages() mapping
> >   * @dev: host device of the mapping for debug
> > + * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
> 
> ^^^^^^^^^^^^^ device memory flags instead ?

Well maybe it will be use for something else than device memory in the future
but yes for now it is only device memory so i can rename it.

> >   */
> >  struct dev_pagemap {
> >  	struct vmem_altmap *altmap;
> >  	const struct resource *res;
> >  	struct percpu_ref *ref;
> >  	struct device *dev;
> > +	int flags;
> >  };
> >  
> >  #ifdef CONFIG_ZONE_DEVICE
> >  void *devm_memremap_pages(struct device *dev, struct resource *res,
> > -		struct percpu_ref *ref, struct vmem_altmap *altmap);
> > +			  struct percpu_ref *ref, struct vmem_altmap *altmap,
> > +			  struct dev_pagemap **ppgmap, int flags);
> >  struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
> > +
> > +static inline bool is_addressable_page(const struct page *page)
> > +{
> > +	return ((page_zonenum(page) != ZONE_DEVICE) ||
> > +		!(page->pgmap->flags & MEMORY_UNADDRESSABLE));
> > +}
> >  #else
> >  static inline void *devm_memremap_pages(struct device *dev,
> > -		struct resource *res, struct percpu_ref *ref,
> > -		struct vmem_altmap *altmap)
> > +					struct resource *res,
> > +					struct percpu_ref *ref,
> > +					struct vmem_altmap *altmap,
> > +					struct dev_pagemap **ppgmap,
> > +					int flags)
> 
> 
> As I had mentioned before devm_memremap_pages() should be changed not
> to accept a valid altmap along with request for un-addressable memory.

If you fear such case yes sure.


[...]

> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 07665eb..438a73aa2 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
> >  	/* pages are dead and unused, undo the arch mapping */
> >  	align_start = res->start & ~(SECTION_SIZE - 1);
> >  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
> > -	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
> > +	arch_remove_memory(align_start, align_size, pgmap->flags);
> >  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
> >  	pgmap_radix_release(res);
> >  	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
> > @@ -270,6 +270,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
> >   * @res: "host memory" address range
> >   * @ref: a live per-cpu reference count
> >   * @altmap: optional descriptor for allocating the memmap from @res
> > + * @ppgmap: pointer set to new page dev_pagemap on success
> > + * @flags: flag for memory (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
> >   *
> >   * Notes:
> >   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
> > @@ -280,7 +282,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
> >   *    this is not enforced.
> >   */
> >  void *devm_memremap_pages(struct device *dev, struct resource *res,
> > -		struct percpu_ref *ref, struct vmem_altmap *altmap)
> > +			  struct percpu_ref *ref, struct vmem_altmap *altmap,
> > +			  struct dev_pagemap **ppgmap, int flags)
> >  {
> >  	resource_size_t key, align_start, align_size, align_end;
> >  	pgprot_t pgprot = PAGE_KERNEL;
> > @@ -322,6 +325,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> >  	}
> >  	pgmap->ref = ref;
> >  	pgmap->res = &page_map->res;
> > +	pgmap->flags = flags | MEMORY_DEVICE;
> 
> So the caller of devm_memremap_pages() should not have give out MEMORY_DEVICE
> in the flag it passed on to this function ? Hmm, else we should just check
> that the flags contains all appropriate bits before proceeding.

Here i was just trying to be on the safe side, yes caller should already have set
the flag but this function is only use for device memory so it did not seem like
it would hurt to be extra safe. I can add a BUG_ON() but it seems people have mix
feeling about BUG_ON()

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-11-21  8:26   ` Anshuman Khandual
@ 2016-11-21 12:34     ` Jerome Glisse
  2016-11-22  5:02       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:34 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 01:56:02PM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
> > is holding a reference on it (only device to which the memory belong do).
> > Add a callback and call it when that happen so device driver can implement
> > their own free page management.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  include/linux/memremap.h | 4 ++++
> >  kernel/memremap.c        | 8 ++++++++
> >  2 files changed, 12 insertions(+)
> > 
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index fe61dca..469c88d 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -37,17 +37,21 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >  
> >  /**
> >   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> > + * @free_devpage: free page callback when page refcount reach 1
> >   * @altmap: pre-allocated/reserved memory for vmemmap allocations
> >   * @res: physical address range covered by @ref
> >   * @ref: reference count that pins the devm_memremap_pages() mapping
> >   * @dev: host device of the mapping for debug
> > + * @data: privata data pointer for free_devpage
> >   * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
> >   */
> >  struct dev_pagemap {
> > +	void (*free_devpage)(struct page *page, void *data);
> >  	struct vmem_altmap *altmap;
> >  	const struct resource *res;
> >  	struct percpu_ref *ref;
> >  	struct device *dev;
> > +	void *data;
> >  	int flags;
> >  };
> >  
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 438a73aa2..3d28048 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
> >  
> >  void put_zone_device_page(struct page *page)
> >  {
> > +	/*
> > +	 * If refcount is 1 then page is freed and refcount is stable as nobody
> > +	 * holds a reference on the page.
> > +	 */
> > +	if (page->pgmap->free_devpage && page_count(page) == 1)
> > +		page->pgmap->free_devpage(page, page->pgmap->data);
> >  	put_dev_pagemap(page->pgmap);
> >  }
> >  EXPORT_SYMBOL(put_zone_device_page);
> > @@ -326,6 +332,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
> >  	pgmap->ref = ref;
> >  	pgmap->res = &page_map->res;
> >  	pgmap->flags = flags | MEMORY_DEVICE;
> > +	pgmap->free_devpage = NULL;
> > +	pgmap->data = NULL;
> 
> When is the driver expected to load up pgmap->free_devpage ? I thought
> this function is one of the right places. Though as all the pages in
> the same hotplug operation point to the same dev_pagemap structure this
> loading can be done at later point of time as well.
> 

I wanted to avoid adding more argument to devm_memremap_pages() as it already
has a long list. Hence why i let the caller set those afterward.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2016-11-21 10:37   ` Anshuman Khandual
@ 2016-11-21 12:39     ` Jerome Glisse
  2016-11-22  4:54       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:39 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 04:07:46PM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > HMM wants to remove device memory early before device tear down so add an
> > helper to do that.
> 
> Could you please explain why HMM wants to remove device memory before
> device tear down ?
> 

Some device driver want to manage memory for several physical devices from a
single fake device driver. Because it fits their driver architecture better
and those physical devices can have dedicated link between them.

Issue is that the fake device driver can outlive any of the real device for a
long time so we want to be able to remove device memory before the fake device
goes away to free up resources early.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-21 10:58   ` Anshuman Khandual
@ 2016-11-21 12:42     ` Jerome Glisse
  2016-11-22  4:48       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:42 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 04:28:04PM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > To allow use of device un-addressable memory inside a process add a
> > special swap type. Also add a new callback to handle page fault on
> > such entry.
> 
> IIUC this swap type is required only for the mirror cases and its
> not a requirement for migration. If it's required for mirroring
> purpose where we intercept each page fault, the commit message
> here should clearly elaborate on that more.

It is only require for un-addressable memory. The mirroring has nothing to do
with it. I will clarify commit message.

[...]

> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index b6f03e9..d584c74 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -47,6 +47,11 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
> >   */
> >  struct dev_pagemap {
> >  	void (*free_devpage)(struct page *page, void *data);
> > +	int (*fault)(struct vm_area_struct *vma,
> > +		     unsigned long addr,
> > +		     struct page *page,
> > +		     unsigned flags,
> > +		     pmd_t *pmdp);
> 
> We are extending the dev_pagemap once again to accommodate device driver
> specific fault routines for these pages. Wondering if this extension and
> the new swap type should be in the same patch.

It make sense to have it in one single patch as i also change page fault code
path to deal with the new special swap entry and those make use of this new
callback.


> > +int device_entry_fault(struct vm_area_struct *vma,
> > +		       unsigned long addr,
> > +		       swp_entry_t entry,
> > +		       unsigned flags,
> > +		       pmd_t *pmdp)
> > +{
> > +	struct page *page = device_entry_to_page(entry);
> > +
> 
> A BUG_ON() if page->pgmap->fault has not been populated by the driver.
> 

Ok

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages
  2016-11-21  8:18   ` Anshuman Khandual
@ 2016-11-21 12:50     ` Jerome Glisse
  2016-11-22  4:30       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-21 12:50 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Mon, Nov 21, 2016 at 01:48:26PM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
> > happen as ZONE_DEVICE page must always have an elevated refcount.
> > 
> > This is to catch refcounting issues in a sane way for ZONE_DEVICE pages.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  mm/page_alloc.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 0fbfead..09b2630 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2435,6 +2435,16 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  	unsigned long pfn = page_to_pfn(page);
> >  	int migratetype;
> >  
> > +	/*
> > +	 * This should never happen ! Page from ZONE_DEVICE always must have an
> > +	 * active refcount. Complain about it and try to restore the refcount.
> > +	 */
> > +	if (is_zone_device_page(page)) {
> > +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
> > +		page_ref_inc(page);
> > +		return;
> > +	}
> 
> This fixes an issue in the existing ZONE_DEVICE code, should not this
> patch be sent separately not in this series ?
> 

Well this is more like a safetynet feature, i can send it separately from the
series. It is not an issue per say as a trap to catch bugs. I had refcounting
bugs while working on this patchset and having this safetynet was helpful to
quickly pin-point issues.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-21  5:05     ` Jerome Glisse
@ 2016-11-22  2:19       ` Balbir Singh
  2016-11-22 13:59         ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Balbir Singh @ 2016-11-22  2:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler



On 21/11/16 16:05, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 01:06:45PM +1100, Balbir Singh wrote:
>>
>>
>> On 19/11/16 05:18, Jérôme Glisse wrote:
>>> To allow use of device un-addressable memory inside a process add a
>>> special swap type. Also add a new callback to handle page fault on
>>> such entry.
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> ---
>>>  fs/proc/task_mmu.c       | 10 +++++++-
>>>  include/linux/memremap.h |  5 ++++
>>>  include/linux/swap.h     | 18 ++++++++++---
>>>  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>  kernel/memremap.c        | 14 ++++++++++
>>>  mm/Kconfig               | 12 +++++++++
>>>  mm/memory.c              | 24 +++++++++++++++++
>>>  mm/mprotect.c            | 12 +++++++++
>>>  8 files changed, 158 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 6909582..0726d39 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>>>  			} else {
>>>  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
>>>  			}
>>> -		} else if (is_migration_entry(swpent))
>>> +		} else if (is_migration_entry(swpent)) {
>>>  			page = migration_entry_to_page(swpent);
>>> +		} else if (is_device_entry(swpent)) {
>>> +			page = device_entry_to_page(swpent);
>>> +		}
>>
>>
>> So the reason there is a device swap entry for a page belonging to a user process is
>> that it is in the middle of migration or is it always that a swap entry represents
>> unaddressable memory belonging to a GPU device, but its tracked in the page table
>> entries of the process.
> 
> For page being migrated i use the existing special migration pte entry. This new device
> special swap entry is only for unaddressable memory belonging to a device (GPU or any
> else). We need to keep track of those inside the CPU page table. Using a new special
> swap entry is the easiest way with the minimum amount of change to core mm.
> 

Thanks, makes sense

> [...]
> 
>>> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
>>> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
>>> +{
>>> +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
>>
>> Code style checks
> 
> I was trying to balance against 79 columns break rule :)
> 
> [...]
> 
>>> +		} else if (is_device_entry(entry)) {
>>> +			page = device_entry_to_page(entry);
>>> +
>>> +			get_page(page);
>>> +			rss[mm_counter(page)]++;
>>
>> Why does rss count go up?
> 
> I wanted the device page to be treated like any other page. There is an argument
> to be made against and for doing that. Do you have strong argument for not doing
> this ?
> 

Yes, It will end up confusing rss accounting IMHO. If a task is using a lot of
pages on the GPU, should be it a candidate for OOM based on it's RSS for example?

> [...]
> 
>>> @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
>>>  	if (unlikely(non_swap_entry(entry))) {
>>>  		if (is_migration_entry(entry)) {
>>>  			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
>>> +		} else if (is_device_entry(entry)) {
>>> +			ret = device_entry_fault(vma, fe->address, entry,
>>> +						 fe->flags, fe->pmd);
>>
>> What does device_entry_fault() actually do here?
> 
> Well it is a special fault handler, it must migrate the memory back to some place
> where the CPU can access it. It only matter for unaddressable memory.

So effectively swap the page back in, chances are it can ping pong ...but I was wondering if we can
tell the GPU that the CPU is accessing these pages as well. I presume any operation that causes
memory access - core dump will swap back in things from the HMM side onto the CPU side.

> 
>>>  		} else if (is_hwpoison_entry(entry)) {
>>>  			ret = VM_FAULT_HWPOISON;
>>>  		} else {
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 1bc1eb3..70aff3a 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>>  
>>>  				pages++;
>>>  			}
>>> +
>>> +			if (is_write_device_entry(entry)) {
>>> +				pte_t newpte;
>>> +
>>> +				make_device_entry_read(&entry);
>>> +				newpte = swp_entry_to_pte(entry);
>>> +				if (pte_swp_soft_dirty(oldpte))
>>> +					newpte = pte_swp_mksoft_dirty(newpte);
>>> +				set_pte_at(mm, addr, pte, newpte);
>>> +
>>> +				pages++;
>>> +			}
>>
>> Does it make sense to call mprotect() on device memory ranges?
> 
> There is nothing special about vma that containt device memory. They can be
> private anonymous, share, file back ... So any existing memory syscall must
> behave as expected. This is really just like any other page except that CPU
> can not access it.

I understand that, but what would marking it as R/O when the GPU is in the middle
of write mean? I would also worry about passing "executable" pages over to the
other side.

Balbir Singh.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages
  2016-11-21 12:50     ` Jerome Glisse
@ 2016-11-22  4:30       ` Anshuman Khandual
  0 siblings, 0 replies; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-22  4:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 11/21/2016 06:20 PM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 01:48:26PM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
>>> Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
>>> happen as ZONE_DEVICE page must always have an elevated refcount.
>>>
>>> This is to catch refcounting issues in a sane way for ZONE_DEVICE pages.
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> ---
>>>  mm/page_alloc.c | 10 ++++++++++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 0fbfead..09b2630 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2435,6 +2435,16 @@ void free_hot_cold_page(struct page *page, bool cold)
>>>  	unsigned long pfn = page_to_pfn(page);
>>>  	int migratetype;
>>>  
>>> +	/*
>>> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
>>> +	 * active refcount. Complain about it and try to restore the refcount.
>>> +	 */
>>> +	if (is_zone_device_page(page)) {
>>> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
>>> +		page_ref_inc(page);
>>> +		return;
>>> +	}
>>
>> This fixes an issue in the existing ZONE_DEVICE code, should not this
>> patch be sent separately not in this series ?
>>
> 
> Well this is more like a safetynet feature, i can send it separately from the
> series. It is not an issue per say as a trap to catch bugs. I had refcounting
> bugs while working on this patchset and having this safetynet was helpful to
> quickly pin-point issues.

Sure at the least move them up in the series as ZONE_DEVICE preparatory
fixes before expanding ZONE_DEVICE framework to accommodate the new
un-addressable memory representation.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-21 12:42     ` Jerome Glisse
@ 2016-11-22  4:48       ` Anshuman Khandual
  2016-11-24 13:56         ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-22  4:48 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 11/21/2016 06:12 PM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 04:28:04PM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
>>> To allow use of device un-addressable memory inside a process add a
>>> special swap type. Also add a new callback to handle page fault on
>>> such entry.
>>
>> IIUC this swap type is required only for the mirror cases and its
>> not a requirement for migration. If it's required for mirroring
>> purpose where we intercept each page fault, the commit message
>> here should clearly elaborate on that more.
> 
> It is only require for un-addressable memory. The mirroring has nothing to do
> with it. I will clarify commit message.

One thing though. I dont recall how persistent memory ZONE_DEVICE
pages are handled inside the page tables, point here is it should
be part of the same code block. We should catch that its a device
memory page and then figure out addressable or not and act
accordingly. Because persistent memory are CPU addressable, there
might not been special code block but dealing with device pages 
should be handled in a more holistic manner.

> 
> [...]
> 
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index b6f03e9..d584c74 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -47,6 +47,11 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>>>   */
>>>  struct dev_pagemap {
>>>  	void (*free_devpage)(struct page *page, void *data);
>>> +	int (*fault)(struct vm_area_struct *vma,
>>> +		     unsigned long addr,
>>> +		     struct page *page,
>>> +		     unsigned flags,
>>> +		     pmd_t *pmdp);
>>
>> We are extending the dev_pagemap once again to accommodate device driver
>> specific fault routines for these pages. Wondering if this extension and
>> the new swap type should be in the same patch.
> 
> It make sense to have it in one single patch as i also change page fault code
> path to deal with the new special swap entry and those make use of this new
> callback.
> 

Okay.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2016-11-21 12:39     ` Jerome Glisse
@ 2016-11-22  4:54       ` Anshuman Khandual
  0 siblings, 0 replies; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-22  4:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 11/21/2016 06:09 PM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 04:07:46PM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
>>> HMM wants to remove device memory early before device tear down so add an
>>> helper to do that.
>>
>> Could you please explain why HMM wants to remove device memory before
>> device tear down ?
>>
> 
> Some device driver want to manage memory for several physical devices from a
> single fake device driver. Because it fits their driver architecture better
> and those physical devices can have dedicated link between them.
> 
> Issue is that the fake device driver can outlive any of the real device for a
> long time so we want to be able to remove device memory before the fake device
> goes away to free up resources early.

Got it.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-11-21 12:34     ` Jerome Glisse
@ 2016-11-22  5:02       ` Anshuman Khandual
  0 siblings, 0 replies; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-22  5:02 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 11/21/2016 06:04 PM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 01:56:02PM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
>>> When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
>>> is holding a reference on it (only device to which the memory belong do).
>>> Add a callback and call it when that happen so device driver can implement
>>> their own free page management.
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> ---
>>>  include/linux/memremap.h | 4 ++++
>>>  kernel/memremap.c        | 8 ++++++++
>>>  2 files changed, 12 insertions(+)
>>>
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index fe61dca..469c88d 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -37,17 +37,21 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>>>  
>>>  /**
>>>   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
>>> + * @free_devpage: free page callback when page refcount reach 1
>>>   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>>>   * @res: physical address range covered by @ref
>>>   * @ref: reference count that pins the devm_memremap_pages() mapping
>>>   * @dev: host device of the mapping for debug
>>> + * @data: privata data pointer for free_devpage
>>>   * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
>>>   */
>>>  struct dev_pagemap {
>>> +	void (*free_devpage)(struct page *page, void *data);
>>>  	struct vmem_altmap *altmap;
>>>  	const struct resource *res;
>>>  	struct percpu_ref *ref;
>>>  	struct device *dev;
>>> +	void *data;
>>>  	int flags;
>>>  };
>>>  
>>> diff --git a/kernel/memremap.c b/kernel/memremap.c
>>> index 438a73aa2..3d28048 100644
>>> --- a/kernel/memremap.c
>>> +++ b/kernel/memremap.c
>>> @@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
>>>  
>>>  void put_zone_device_page(struct page *page)
>>>  {
>>> +	/*
>>> +	 * If refcount is 1 then page is freed and refcount is stable as nobody
>>> +	 * holds a reference on the page.
>>> +	 */
>>> +	if (page->pgmap->free_devpage && page_count(page) == 1)
>>> +		page->pgmap->free_devpage(page, page->pgmap->data);
>>>  	put_dev_pagemap(page->pgmap);
>>>  }
>>>  EXPORT_SYMBOL(put_zone_device_page);
>>> @@ -326,6 +332,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>>>  	pgmap->ref = ref;
>>>  	pgmap->res = &page_map->res;
>>>  	pgmap->flags = flags | MEMORY_DEVICE;
>>> +	pgmap->free_devpage = NULL;
>>> +	pgmap->data = NULL;
>>
>> When is the driver expected to load up pgmap->free_devpage ? I thought
>> this function is one of the right places. Though as all the pages in
>> the same hotplug operation point to the same dev_pagemap structure this
>> loading can be done at later point of time as well.
>>
> 
> I wanted to avoid adding more argument to devm_memremap_pages() as it already
> has a long list. Hence why i let the caller set those afterward.

IMHO we should still pass it through this function argument so that
by the time the function returns we will have device memory properly
setup through ZONE_DEVICE with all bells and whistles enabled.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-11-21 12:33     ` Jerome Glisse
@ 2016-11-22  5:15       ` Anshuman Khandual
  0 siblings, 0 replies; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-22  5:15 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 11/21/2016 06:03 PM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 01:36:57PM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
>>> This add support for un-addressable device memory. Such memory is hotpluged
>>> only so we can have struct page but should never be map. This patch add code
>>
>> struct pages inside the system RAM range unlike the vmem_altmap scheme
>> where the struct pages can be inside the device memory itself. This
>> possibility does not arise for un addressable device memory. May be we
>> will have to block the paths where vmem_altmap is requested along with
>> un addressable device memory.
> 
> I did not think checking for that explicitly was necessary, sounded like shooting
> yourself in the foot and that it would be obvious :)

dev_memremap_pages() is kind of an important interface for getting
device memory into kernel through ZONE_DEVICE. So it should actually
enforce all these checks. Also we should document these things clearly
above the function.

> 
> [...]
> 
>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>>> index 9341619..fe61dca 100644
>>> --- a/include/linux/memremap.h
>>> +++ b/include/linux/memremap.h
>>> @@ -41,22 +41,34 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>>>   * @res: physical address range covered by @ref
>>>   * @ref: reference count that pins the devm_memremap_pages() mapping
>>>   * @dev: host device of the mapping for debug
>>> + * @flags: memory flags (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
>>
>> ^^^^^^^^^^^^^ device memory flags instead ?
> 
> Well maybe it will be use for something else than device memory in the future
> but yes for now it is only device memory so i can rename it.
> 
>>>   */
>>>  struct dev_pagemap {
>>>  	struct vmem_altmap *altmap;
>>>  	const struct resource *res;
>>>  	struct percpu_ref *ref;
>>>  	struct device *dev;
>>> +	int flags;
>>>  };
>>>  
>>>  #ifdef CONFIG_ZONE_DEVICE
>>>  void *devm_memremap_pages(struct device *dev, struct resource *res,
>>> -		struct percpu_ref *ref, struct vmem_altmap *altmap);
>>> +			  struct percpu_ref *ref, struct vmem_altmap *altmap,
>>> +			  struct dev_pagemap **ppgmap, int flags);
>>>  struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
>>> +
>>> +static inline bool is_addressable_page(const struct page *page)
>>> +{
>>> +	return ((page_zonenum(page) != ZONE_DEVICE) ||
>>> +		!(page->pgmap->flags & MEMORY_UNADDRESSABLE));
>>> +}
>>>  #else
>>>  static inline void *devm_memremap_pages(struct device *dev,
>>> -		struct resource *res, struct percpu_ref *ref,
>>> -		struct vmem_altmap *altmap)
>>> +					struct resource *res,
>>> +					struct percpu_ref *ref,
>>> +					struct vmem_altmap *altmap,
>>> +					struct dev_pagemap **ppgmap,
>>> +					int flags)
>>
>>
>> As I had mentioned before devm_memremap_pages() should be changed not
>> to accept a valid altmap along with request for un-addressable memory.
> 
> If you fear such case yes sure.
> 
> 
> [...]
> 
>>> diff --git a/kernel/memremap.c b/kernel/memremap.c
>>> index 07665eb..438a73aa2 100644
>>> --- a/kernel/memremap.c
>>> +++ b/kernel/memremap.c
>>> @@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
>>>  	/* pages are dead and unused, undo the arch mapping */
>>>  	align_start = res->start & ~(SECTION_SIZE - 1);
>>>  	align_size = ALIGN(resource_size(res), SECTION_SIZE);
>>> -	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
>>> +	arch_remove_memory(align_start, align_size, pgmap->flags);
>>>  	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
>>>  	pgmap_radix_release(res);
>>>  	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
>>> @@ -270,6 +270,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>>>   * @res: "host memory" address range
>>>   * @ref: a live per-cpu reference count
>>>   * @altmap: optional descriptor for allocating the memmap from @res
>>> + * @ppgmap: pointer set to new page dev_pagemap on success
>>> + * @flags: flag for memory (look for MEMORY_FLAGS_NONE in memory_hotplug.h)
>>>   *
>>>   * Notes:
>>>   * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
>>> @@ -280,7 +282,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>>>   *    this is not enforced.
>>>   */
>>>  void *devm_memremap_pages(struct device *dev, struct resource *res,
>>> -		struct percpu_ref *ref, struct vmem_altmap *altmap)
>>> +			  struct percpu_ref *ref, struct vmem_altmap *altmap,
>>> +			  struct dev_pagemap **ppgmap, int flags)
>>>  {
>>>  	resource_size_t key, align_start, align_size, align_end;
>>>  	pgprot_t pgprot = PAGE_KERNEL;
>>> @@ -322,6 +325,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>>>  	}
>>>  	pgmap->ref = ref;
>>>  	pgmap->res = &page_map->res;
>>> +	pgmap->flags = flags | MEMORY_DEVICE;
>>
>> So the caller of devm_memremap_pages() should not have give out MEMORY_DEVICE
>> in the flag it passed on to this function ? Hmm, else we should just check
>> that the flags contains all appropriate bits before proceeding.
> 
> Here i was just trying to be on the safe side, yes caller should already have set
> the flag but this function is only use for device memory so it did not seem like
> it would hurt to be extra safe. I can add a BUG_ON() but it seems people have mix
> feeling about BUG_ON()

We dont have to do BUG_ON(), just a check that all expected flags are
in there, else fail the call. Now this function does not return any
value to be checked inside driver, in that case we can just do a error
message print and move on.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-21 12:27     ` Jerome Glisse
@ 2016-11-22  5:35       ` Anshuman Khandual
  2016-11-22 14:08         ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-22  5:35 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dan Williams

On 11/21/2016 05:57 PM, Jerome Glisse wrote:
> On Mon, Nov 21, 2016 at 12:11:50PM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> 
> [...]
> 
>>> @@ -956,7 +963,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>>>  	remove_pagetable(start, end, true);
>>>  }
>>>  
>>> -int __ref arch_remove_memory(u64 start, u64 size)
>>> +int __ref arch_remove_memory(u64 start, u64 size, int flags)
>>>  {
>>>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>>>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>>> @@ -965,6 +972,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
>>>  	struct zone *zone;
>>>  	int ret;
>>>  
>>> +	/* Need to add support for device and unaddressable memory if needed */
>>> +	if (flags & MEMORY_UNADDRESSABLE) {
>>> +		BUG();
>>> +		return -EINVAL;
>>> +	}
>>> +
>>>  	/* With altmap the first mapped page is offset from @start */
>>>  	altmap = to_vmem_altmap((unsigned long) page);
>>>  	if (altmap)
>>
>> So with this patch none of the architectures support un-addressable
>> memory but then support will be added through later patches ?
>> zone_for_memory function's flag now takes MEMORY_DEVICE parameter.
>> Then we need to change all the previous ZONE_DEVICE changes which
>> ever took "for_device" to accommodate this new flag ? just curious.
> 
> Yes correct.
> 
> 
>>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>>> index 01033fa..ba9b12e 100644
>>> --- a/include/linux/memory_hotplug.h
>>> +++ b/include/linux/memory_hotplug.h
>>> @@ -103,7 +103,7 @@ extern bool memhp_auto_online;
>>>  
>>>  #ifdef CONFIG_MEMORY_HOTREMOVE
>>>  extern bool is_pageblock_removable_nolock(struct page *page);
>>> -extern int arch_remove_memory(u64 start, u64 size);
>>> +extern int arch_remove_memory(u64 start, u64 size, int flags);
>>>  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
>>>  	unsigned long nr_pages);
>>>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>>> @@ -275,7 +275,20 @@ extern int add_memory(int nid, u64 start, u64 size);
>>>  extern int add_memory_resource(int nid, struct resource *resource, bool online);
>>>  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>>>  		bool for_device);
>>> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
>>> +
>>> +/*
>>> + * For device memory we want more informations than just knowing it is device
>>> + * memory. We want to know if we can migrate it (ie it is not storage memory
>>> + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
>>> + * memory can not be access by CPU but we still want struct page so that we
>>> + * can use it like regular memory.
>>
>> Some typos here. Needs to be cleaned up as well. But please have a
>> look at comment below over the classification itself.
>>
>>> + */
>>> +#define MEMORY_FLAGS_NONE 0
>>> +#define MEMORY_DEVICE (1 << 0)
>>> +#define MEMORY_MOVABLE (1 << 1)
>>> +#define MEMORY_UNADDRESSABLE (1 << 2)
>>
>> It should be DEVICE_MEMORY_* instead of MEMORY_* as we are trying to
>> classify device memory (though they are represented with struct page)
>> not regular system ram memory. This should attempt to classify device
>> memory which is backed by struct pages. arch_add_memory/arch_remove
>> _memory does not come into play if it's traditional device memory
>> which is just PFN and does not have struct page associated with it.
> 
> Good idea i will change that.
> 
> 
>> Broadly they are either CPU accessible or in-accessible. Storage
>> memory like persistent memory represented though ZONE_DEVICE fall
>> under the accessible (coherent) category. IIUC right now they are
>> not movable because page->pgmap replaces page->lru in struct page
>> hence its inability to be on standard LRU lists as one of the
>> reasons. As there was a need to have struct page to exploit more
>> core VM features on these memory going forward it will have to be
>> migratable one way or the other to accommodate features like
>> compaction, HW poison etc in these storage memory. Hence my point
>> here is lets not classify any of these memories as non-movable.
>> Just addressable or not should be the only classification.
> 
> Being on the lru or not is not and issue in respect to migration. Being

Right, provided we we create separate migration interfaces for these non
LRU pages (preferably through HMM migration API layer). But where it
stands today, for NUMA migrate_pages() interface device non LRU memory
is a problem and we cannot use it for migration. Hence I brought up the
non LRU issue here.

> on the lru was use as an indication that the page is manage through the
> standard mm code and thus that many assumptions hold which in turn do
> allow migration. But if one use device memory following all rules of
> regular memory then migration can be done to no matter if page is on
> lru or not.

Right.

> 
> I still think that the MOVABLE is an important distinction as i am pretty
> sure that the persistent folks do not want to see their page migrated in
> anyway. I might rename it to DEVICE_MEMORY_ALLOW_MIGRATION.

We should not classify memory based on whether there is a *requirement*
for migration or not at this point of time, the classification should
be done if its inherently migratable or not. I dont see any reason why
persistent memory cannot be migrated. I am not very familiar with DAX
file system and its use of persistent memory but I would guess that
their requirement for compaction and error handling happens way above
in file system layers, hence they never needed these support at struct
page level. I am just guessing.

Added Dan J Williams in this thread list, he might be able to give us
some more details regarding persistent memory migration requirements
and it's current state.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-22  2:19       ` Balbir Singh
@ 2016-11-22 13:59         ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-22 13:59 UTC (permalink / raw)
  To: Balbir Singh
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Tue, Nov 22, 2016 at 01:19:42PM +1100, Balbir Singh wrote:
> 
> 
> On 21/11/16 16:05, Jerome Glisse wrote:
> > On Mon, Nov 21, 2016 at 01:06:45PM +1100, Balbir Singh wrote:
> >>
> >>
> >> On 19/11/16 05:18, Jérôme Glisse wrote:
> >>> To allow use of device un-addressable memory inside a process add a
> >>> special swap type. Also add a new callback to handle page fault on
> >>> such entry.
> >>>
> >>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>> ---
> >>>  fs/proc/task_mmu.c       | 10 +++++++-
> >>>  include/linux/memremap.h |  5 ++++
> >>>  include/linux/swap.h     | 18 ++++++++++---
> >>>  include/linux/swapops.h  | 67 ++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  kernel/memremap.c        | 14 ++++++++++
> >>>  mm/Kconfig               | 12 +++++++++
> >>>  mm/memory.c              | 24 +++++++++++++++++
> >>>  mm/mprotect.c            | 12 +++++++++
> >>>  8 files changed, 158 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>> index 6909582..0726d39 100644
> >>> --- a/fs/proc/task_mmu.c
> >>> +++ b/fs/proc/task_mmu.c
> >>> @@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
> >>>  			} else {
> >>>  				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
> >>>  			}
> >>> -		} else if (is_migration_entry(swpent))
> >>> +		} else if (is_migration_entry(swpent)) {
> >>>  			page = migration_entry_to_page(swpent);
> >>> +		} else if (is_device_entry(swpent)) {
> >>> +			page = device_entry_to_page(swpent);
> >>> +		}
> >>
> >>
> >> So the reason there is a device swap entry for a page belonging to a user process is
> >> that it is in the middle of migration or is it always that a swap entry represents
> >> unaddressable memory belonging to a GPU device, but its tracked in the page table
> >> entries of the process.
> > 
> > For page being migrated i use the existing special migration pte entry. This new device
> > special swap entry is only for unaddressable memory belonging to a device (GPU or any
> > else). We need to keep track of those inside the CPU page table. Using a new special
> > swap entry is the easiest way with the minimum amount of change to core mm.
> > 
> 
> Thanks, makes sense
> 
> > [...]
> > 
> >>> +#ifdef CONFIG_DEVICE_UNADDRESSABLE
> >>> +static inline swp_entry_t make_device_entry(struct page *page, bool write)
> >>> +{
> >>> +	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
> >>
> >> Code style checks
> > 
> > I was trying to balance against 79 columns break rule :)
> > 
> > [...]
> > 
> >>> +		} else if (is_device_entry(entry)) {
> >>> +			page = device_entry_to_page(entry);
> >>> +
> >>> +			get_page(page);
> >>> +			rss[mm_counter(page)]++;
> >>
> >> Why does rss count go up?
> > 
> > I wanted the device page to be treated like any other page. There is an argument
> > to be made against and for doing that. Do you have strong argument for not doing
> > this ?
> > 
> 
> Yes, It will end up confusing rss accounting IMHO. If a task is using a lot of
> pages on the GPU, should be it a candidate for OOM based on it's RSS for example?
> 
> > [...]
> > 
> >>> @@ -2536,6 +2557,9 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
> >>>  	if (unlikely(non_swap_entry(entry))) {
> >>>  		if (is_migration_entry(entry)) {
> >>>  			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
> >>> +		} else if (is_device_entry(entry)) {
> >>> +			ret = device_entry_fault(vma, fe->address, entry,
> >>> +						 fe->flags, fe->pmd);
> >>
> >> What does device_entry_fault() actually do here?
> > 
> > Well it is a special fault handler, it must migrate the memory back to some place
> > where the CPU can access it. It only matter for unaddressable memory.
> 
> So effectively swap the page back in, chances are it can ping pong ...but I was wondering if we can
> tell the GPU that the CPU is accessing these pages as well. I presume any operation that causes
> memory access - core dump will swap back in things from the HMM side onto the CPU side.

Well it is up to device driver to gather statistic on what can/should be inside device memory.
My expectation is that they will detect ping pong and stop asking to migrate a given address/
range to device memory.

> 
> > 
> >>>  		} else if (is_hwpoison_entry(entry)) {
> >>>  			ret = VM_FAULT_HWPOISON;
> >>>  		} else {
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index 1bc1eb3..70aff3a 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >>>  
> >>>  				pages++;
> >>>  			}
> >>> +
> >>> +			if (is_write_device_entry(entry)) {
> >>> +				pte_t newpte;
> >>> +
> >>> +				make_device_entry_read(&entry);
> >>> +				newpte = swp_entry_to_pte(entry);
> >>> +				if (pte_swp_soft_dirty(oldpte))
> >>> +					newpte = pte_swp_mksoft_dirty(newpte);
> >>> +				set_pte_at(mm, addr, pte, newpte);
> >>> +
> >>> +				pages++;
> >>> +			}
> >>
> >> Does it make sense to call mprotect() on device memory ranges?
> > 
> > There is nothing special about vma that containt device memory. They can be
> > private anonymous, share, file back ... So any existing memory syscall must
> > behave as expected. This is really just like any other page except that CPU
> > can not access it.
> 
> I understand that, but what would marking it as R/O when the GPU is in the middle
> of write mean? I would also worry about passing "executable" pages over to the
> other side.
> 

Any memory protection change will trigger an mmu_notifier calls which in turn will
update the device page table accordingly. So R/O status will also happen on the GPU.

We assume here that the device driver is not doing evil thing and that device driver
obey memory protection for all range it mirrors. Upstream driver are easy to check.
Close driver might be more problematic, in NVidia case this part if open source and
is easily checkable.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags
  2016-11-22  5:35       ` Anshuman Khandual
@ 2016-11-22 14:08         ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-22 14:08 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dan Williams

On Tue, Nov 22, 2016 at 11:05:30AM +0530, Anshuman Khandual wrote:
> On 11/21/2016 05:57 PM, Jerome Glisse wrote:
> > On Mon, Nov 21, 2016 at 12:11:50PM +0530, Anshuman Khandual wrote:
> >> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > 
> > [...]
> > 
> >>> @@ -956,7 +963,7 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> >>>  	remove_pagetable(start, end, true);
> >>>  }
> >>>  
> >>> -int __ref arch_remove_memory(u64 start, u64 size)
> >>> +int __ref arch_remove_memory(u64 start, u64 size, int flags)
> >>>  {
> >>>  	unsigned long start_pfn = start >> PAGE_SHIFT;
> >>>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> >>> @@ -965,6 +972,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
> >>>  	struct zone *zone;
> >>>  	int ret;
> >>>  
> >>> +	/* Need to add support for device and unaddressable memory if needed */
> >>> +	if (flags & MEMORY_UNADDRESSABLE) {
> >>> +		BUG();
> >>> +		return -EINVAL;
> >>> +	}
> >>> +
> >>>  	/* With altmap the first mapped page is offset from @start */
> >>>  	altmap = to_vmem_altmap((unsigned long) page);
> >>>  	if (altmap)
> >>
> >> So with this patch none of the architectures support un-addressable
> >> memory but then support will be added through later patches ?
> >> zone_for_memory function's flag now takes MEMORY_DEVICE parameter.
> >> Then we need to change all the previous ZONE_DEVICE changes which
> >> ever took "for_device" to accommodate this new flag ? just curious.
> > 
> > Yes correct.
> > 
> > 
> >>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> >>> index 01033fa..ba9b12e 100644
> >>> --- a/include/linux/memory_hotplug.h
> >>> +++ b/include/linux/memory_hotplug.h
> >>> @@ -103,7 +103,7 @@ extern bool memhp_auto_online;
> >>>  
> >>>  #ifdef CONFIG_MEMORY_HOTREMOVE
> >>>  extern bool is_pageblock_removable_nolock(struct page *page);
> >>> -extern int arch_remove_memory(u64 start, u64 size);
> >>> +extern int arch_remove_memory(u64 start, u64 size, int flags);
> >>>  extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
> >>>  	unsigned long nr_pages);
> >>>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> >>> @@ -275,7 +275,20 @@ extern int add_memory(int nid, u64 start, u64 size);
> >>>  extern int add_memory_resource(int nid, struct resource *resource, bool online);
> >>>  extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
> >>>  		bool for_device);
> >>> -extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
> >>> +
> >>> +/*
> >>> + * For device memory we want more informations than just knowing it is device
> >>> + * memory. We want to know if we can migrate it (ie it is not storage memory
> >>> + * use by DAX). Is it addressable by the CPU ? Some device memory like GPU
> >>> + * memory can not be access by CPU but we still want struct page so that we
> >>> + * can use it like regular memory.
> >>
> >> Some typos here. Needs to be cleaned up as well. But please have a
> >> look at comment below over the classification itself.
> >>
> >>> + */
> >>> +#define MEMORY_FLAGS_NONE 0
> >>> +#define MEMORY_DEVICE (1 << 0)
> >>> +#define MEMORY_MOVABLE (1 << 1)
> >>> +#define MEMORY_UNADDRESSABLE (1 << 2)
> >>
> >> It should be DEVICE_MEMORY_* instead of MEMORY_* as we are trying to
> >> classify device memory (though they are represented with struct page)
> >> not regular system ram memory. This should attempt to classify device
> >> memory which is backed by struct pages. arch_add_memory/arch_remove
> >> _memory does not come into play if it's traditional device memory
> >> which is just PFN and does not have struct page associated with it.
> > 
> > Good idea i will change that.
> > 
> > 
> >> Broadly they are either CPU accessible or in-accessible. Storage
> >> memory like persistent memory represented though ZONE_DEVICE fall
> >> under the accessible (coherent) category. IIUC right now they are
> >> not movable because page->pgmap replaces page->lru in struct page
> >> hence its inability to be on standard LRU lists as one of the
> >> reasons. As there was a need to have struct page to exploit more
> >> core VM features on these memory going forward it will have to be
> >> migratable one way or the other to accommodate features like
> >> compaction, HW poison etc in these storage memory. Hence my point
> >> here is lets not classify any of these memories as non-movable.
> >> Just addressable or not should be the only classification.
> > 
> > Being on the lru or not is not and issue in respect to migration. Being
> 
> Right, provided we we create separate migration interfaces for these non
> LRU pages (preferably through HMM migration API layer). But where it
> stands today, for NUMA migrate_pages() interface device non LRU memory
> is a problem and we cannot use it for migration. Hence I brought up the
> non LRU issue here.
> 
> > on the lru was use as an indication that the page is manage through the
> > standard mm code and thus that many assumptions hold which in turn do
> > allow migration. But if one use device memory following all rules of
> > regular memory then migration can be done to no matter if page is on
> > lru or not.
> 
> Right.
> 
> > 
> > I still think that the MOVABLE is an important distinction as i am pretty
> > sure that the persistent folks do not want to see their page migrated in
> > anyway. I might rename it to DEVICE_MEMORY_ALLOW_MIGRATION.
> 
> We should not classify memory based on whether there is a *requirement*
> for migration or not at this point of time, the classification should
> be done if its inherently migratable or not. I dont see any reason why
> persistent memory cannot be migrated. I am not very familiar with DAX
> file system and its use of persistent memory but I would guess that
> their requirement for compaction and error handling happens way above
> in file system layers, hence they never needed these support at struct
> page level. I am just guessing.
> 
> Added Dan J Williams in this thread list, he might be able to give us
> some more details regarding persistent memory migration requirements
> and it's current state.
> 

Well my patches change even the existing migrate code and thus i need to make
sure i do not change behavior that DAX/persistent memory folks rely on. Hence
i need a flag to allow device page migration or not. If that flag prove useless
latter on it can be remove but for now to maintain existing behavior it is
needed.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-18 18:18 ` [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
  2016-11-21  2:29   ` Balbir Singh
@ 2016-11-23  4:03   ` Anshuman Khandual
  2016-11-27 13:10     ` Jerome Glisse
  1 sibling, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-23  4:03 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jatin Kumar, Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> HMM provides 3 separate functionality :
>     - Mirroring: synchronize CPU page table and device page table
>     - Device memory: allocating struct page for device memory
>     - Migration: migrating regular memory to device memory
> 
> This patch introduces some common helpers and definitions to all of
> those 3 functionality.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Jatin Kumar <jakumar@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
> Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
> Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
> ---
>  MAINTAINERS              |   7 +++
>  include/linux/hmm.h      | 139 +++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm_types.h |   5 ++
>  kernel/fork.c            |   2 +
>  mm/Kconfig               |  11 ++++
>  mm/Makefile              |   1 +
>  mm/hmm.c                 |  86 +++++++++++++++++++++++++++++
>  7 files changed, 251 insertions(+)
>  create mode 100644 include/linux/hmm.h
>  create mode 100644 mm/hmm.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f593300..41cd63d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5582,6 +5582,13 @@ S:	Supported
>  F:	drivers/scsi/hisi_sas/
>  F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
>  
> +HMM - Heterogeneous Memory Management
> +M:	Jérôme Glisse <jglisse@redhat.com>
> +L:	linux-mm@kvack.org
> +S:	Maintained
> +F:	mm/hmm*
> +F:	include/linux/hmm*
> +
>  HOST AP DRIVER
>  M:	Jouni Malinen <j@w1.fi>
>  L:	hostap@shmoo.com (subscribers-only)
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> new file mode 100644
> index 0000000..54dd529
> --- /dev/null
> +++ b/include/linux/hmm.h
> @@ -0,0 +1,139 @@
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +/*
> + * HMM provides 3 separate functionality :
> + *   - Mirroring: synchronize CPU page table and device page table
> + *   - Device memory: allocating struct page for device memory
> + *   - Migration: migrating regular memory to device memory
> + *
> + * Each can be use independently from the others.

Small nit s/use/used/

> + *
> + *
> + * Mirroring:
> + *
> + * HMM provide helpers to mirror process address space on a device. For this it
> + * provides several helpers to order device page table update in respect to CPU
> + * page table update. Requirement is that for any given virtual address the CPU
> + * and device page table can not point to different physical page. It uses the
> + * mmu_notifier API and introduce virtual address range lock which block CPU
> + * page table update for a range while the device page table is being updated.
> + * Usage pattern is:
> + *
> + *      hmm_vma_range_lock(vma, start, end);
> + *      // snap shot CPU page table
> + *      // update device page table from snapshot
> + *      hmm_vma_range_unlock(vma, start, end);

This code block can be explained better in more detail.

> + *
> + * Any CPU page table update that conflict with a range lock will wait until
> + * range is unlock. This garanty proper serialization of CPU and device page
> + * table update.
> + *

Small typo in here      ^^^^^^^^^^^^

> + *
> + * Device memory:
> + *
> + * HMM provides helpers to help leverage device memory either addressable like
> + * regular memory by the CPU or un-addressable at all. In both case the device
> + * memory is associated to dedicated structs page (which are allocated like for
> + * hotplug memory). Device memory management is under the responsability of the

Typo in here                                               ^^^^^^^^^^^^^^^^

> + * device driver. HMM only allocate and initialize the struct pages associated
> + * with the device memory.

We should also mention that its hot plugged into the kernel as a ZONE_DEVICE
based memory.

> + *
> + * Allocating struct page for device memory allow to use device memory allmost
> + * like any regular memory. Unlike regular memory it can not be added to the
> + * lru, nor can any memory allocation can use device memory directly. Device
> + * memory will only end up to be use in a process if device driver migrate some
> + * of the process memory from regular memory to device memory.
> + *
> + *
> + * Migration:
> + *
> + * Existing memory migration mechanism (mm/migrate.c) does not allow to use
> + * something else than the CPU to copy from source to destination memory. More
> + * over existing code is not tailor to drive migration from process virtual
> + * address rather than from list of pages. Finaly the migration flow does not
> + * allow for graceful failure at different step of the migration process.

The primary reason being the fact that migrate_pages() interface handles system
memory LRU pages and at this point it cannot handle these new ZONE_DEVICE based
pages whether they are addressable or not. IIUC HMM migration API layer intends
to handle both LRU system ram pages and non LRU ZONE_DEVICE struct pages and
achieve the migration both ways. The API should also include a struct page list
based migration (like migrate_pages()) along with the proposed virtual range
based migration. So the driver can choose either approach. Going forward this
API layer should also include migration interface for the addressable ZONE_DEVICE
pages like persistent memory.

> + *
> + * HMM solves all of the above though simple API :

I guess you meant "through" instead of "though".

> + *
> + *      hmm_vma_migrate(vma, start, end, ops);
> + *
> + * With ops struct providing 2 callback alloc_and_copy() which allocated the
> + * destination memory and initialize it using source memory. Migration can fail
> + * after this step and thus last callback finalize_and_map() allow the device
> + * driver to know which page were successfully migrated and which were not.

So we have page->pgmap->free_devpage() to release the individual page back
into the device driver management during migration and also we have this ops
based finalize_and_mmap() to check on the failed instances inside a single
migration context which can contain set of pages at a time.

> + *
> + * This can easily be use outside of HMM intended use case.

Where you think this can be used outside of HMM ?

> + *
> + *
> + * This header file contain all the API related to this 3 functionality and
> + * each functions and struct are more thouroughly documented in below comments.

Typo s/thouroughly/thoroughly/

> + */
> +#ifndef LINUX_HMM_H
> +#define LINUX_HMM_H
> +
> +#include <linux/kconfig.h>
> +
> +#if IS_ENABLED(CONFIG_HMM)
> +
> +
> +/*
> + * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
> + *
> + * Flags:
> + * HMM_PFN_VALID: pfn is valid
> + * HMM_PFN_WRITE: CPU page table have the write permission set
> + */
> +typedef unsigned long hmm_pfn_t;
> +
> +#define HMM_PFN_VALID (1 << 0)
> +#define HMM_PFN_WRITE (1 << 1)
> +#define HMM_PFN_SHIFT 2
> +
> +static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
> +{
> +	if (!(pfn & HMM_PFN_VALID))
> +		return NULL;
> +	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> +}
> +
> +static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
> +{
> +	if (!(pfn & HMM_PFN_VALID))
> +		return -1UL;
> +	return (pfn >> HMM_PFN_SHIFT);
> +}
> +
> +static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
> +{
> +	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +}
> +
> +static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
> +{
> +	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> +}

Hmm, so if we use last two bits on PFN as flags, it does reduce the number of
bits available for the actual PFN range. But given that we support maximum of
64TB on POWER (not sure about X86) we can live with this two bits going away
from the unsigned long. But what is the purpose of tracking validity and write
flag inside the PFN ?

> +
> +
> +/* Below are for HMM internal use only ! Not to be use by device driver ! */

s/use/used/

> +void hmm_mm_destroy(struct mm_struct *mm);
> +
> +#else /* IS_ENABLED(CONFIG_HMM) */
> +
> +/* Below are for HMM internal use only ! Not to be use by device driver ! */


ditto

> +static inline void hmm_mm_destroy(struct mm_struct *mm) {}
> +
> +#endif /* IS_ENABLED(CONFIG_HMM) */
> +#endif /* LINUX_HMM_H */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4a8aced..4effdbf 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -23,6 +23,7 @@
>  
>  struct address_space;
>  struct mem_cgroup;
> +struct hmm;
>  
>  #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
>  #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
> @@ -516,6 +517,10 @@ struct mm_struct {
>  	atomic_long_t hugetlb_usage;
>  #endif
>  	struct work_struct async_put_work;
> +#if IS_ENABLED(CONFIG_HMM)
> +	/* HMM need to track few things per mm */
> +	struct hmm *hmm;
> +#endif
>  };

hmm, so the HMM structure is one for each mm context.

>  
>  static inline void mm_init_cpumask(struct mm_struct *mm)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 690a1aad..af0eec8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -27,6 +27,7 @@
>  #include <linux/binfmts.h>
>  #include <linux/mman.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/hmm.h>
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/vmacache.h>
> @@ -702,6 +703,7 @@ void __mmdrop(struct mm_struct *mm)
>  	BUG_ON(mm == &init_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> +	hmm_mm_destroy(mm);
>  	mmu_notifier_mm_destroy(mm);
>  	check_mm(mm);
>  	free_mm(mm);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0a21411..be18cc2 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -289,6 +289,17 @@ config MIGRATION
>  config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  	bool
>  
> +config HMM
> +	bool "Heterogeneous memory management (HMM)"
> +	depends on MMU
> +	default n
> +	help
> +	  Heterogeneous memory management, set of helpers for:
> +	    - mirroring of process address space on a device
> +	    - using device memory transparently inside a process
> +
> +	  If unsure, say N to disable HMM.
> +
>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
>  
> diff --git a/mm/Makefile b/mm/Makefile
> index 2ca1faf..6ac1284 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -76,6 +76,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_HMM) += hmm.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/hmm.c b/mm/hmm.c
> new file mode 100644
> index 0000000..342b596
> --- /dev/null
> +++ b/mm/hmm.c
> @@ -0,0 +1,86 @@
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +/*
> + * Refer to include/linux/hmm.h for informations about heterogeneous memory

s/informations/information/

> + * management or HMM for short.
> + */
> +#include <linux/mm.h>
> +#include <linux/hmm.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +
> +/*
> + * struct hmm - HMM per mm struct
> + *
> + * @mm: mm struct this HMM struct is bound to
> + */
> +struct hmm {
> +	struct mm_struct	*mm;
> +};

So right now its empty other than this link back to the struct mm.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13
  2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
                   ` (18 preceding siblings ...)
  2016-11-19  0:41 ` [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 John Hubbard
@ 2016-11-23  9:16 ` Haggai Eran
  2016-11-25 16:16   ` Jerome Glisse
  19 siblings, 1 reply; 73+ messages in thread
From: Haggai Eran @ 2016-11-23  9:16 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Feras Daoud, Ilya Lesokhin, Liran Liss

On 11/18/2016 8:18 PM, Jérôme Glisse wrote:
> Cliff note: HMM offers 2 things (each standing on its own). First
> it allows to use device memory transparently inside any process
> without any modifications to process program code. Second it allows
> to mirror process address space on a device.
> 
> Change since v12 is the use of struct page for device memory even if
> the device memory is not accessible by the CPU (because of limitation
> impose by the bus between the CPU and the device).
> 
> Using struct page means that their are minimal changes to core mm
> code. HMM build on top of ZONE_DEVICE to provide struct page, it
> adds new features to ZONE_DEVICE. The first 7 patches implement
> those changes.
> 
> Rest of patchset is divided into 3 features that can each be use
> independently from one another. First is the process address space
> mirroring (patch 9 to 13), this allow to snapshot CPU page table
> and to keep the device page table synchronize with the CPU one.
> 
> Second is a new memory migration helper which allow migration of
> a range of virtual address of a process. This memory migration
> also allow device to use their own DMA engine to perform the copy
> between the source memory and destination memory. This can be
> usefull even outside HMM context in many usecase.
> 
> Third part of the patchset (patch 17-18) is a set of helper to
> register a ZONE_DEVICE node and manage it. It is meant as a
> convenient helper so that device drivers do not each have to
> reimplement over and over the same boiler plate code.
> 
> 
> I am hoping that this can now be consider for inclusion upstream.
> Bottom line is that without HMM we can not support some of the new
> hardware features on x86 PCIE. I do believe we need some solution
> to support those features or we won't be able to use such hardware
> in standard like C++17, OpenCL 3.0 and others.
> 
> I have been working with NVidia to bring up this feature on their
> Pascal GPU. There are real hardware that you can buy today that
> could benefit from HMM. We also intend to leverage this inside the
> open source nouveau driver.


Hi,

I think the way this new version of the patchset uses ZONE_DEVICE looks
promising and makes the patchset a little simpler than the previous
versions.

The mirroring code seems like it could be used to simplify the on-demand
paging code in the mlx5 driver and the RDMA subsystem. It currently uses
mmu notifiers directly.

I'm also curious whether it can be used to allow peer to peer access
between devices. For instance, if one device calls hmm_vma_get_pfns on a
process that has unaddressable memory mapped in, with some additional
help from DMA-API, its driver can convert these pfns to bus addresses
directed to another device's MMIO region and thus enable peer to peer
access. Then by handling invalidations through HMM's mirroring callbacks
it can safely handle cases where the peer migrates the page back to the
CPU or frees it.

Haggai

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable
  2016-11-22  4:48       ` Anshuman Khandual
@ 2016-11-24 13:56         ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-24 13:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Tue, Nov 22, 2016 at 10:18:27AM +0530, Anshuman Khandual wrote:
> On 11/21/2016 06:12 PM, Jerome Glisse wrote:
> > On Mon, Nov 21, 2016 at 04:28:04PM +0530, Anshuman Khandual wrote:
> >> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> >>> To allow use of device un-addressable memory inside a process add a
> >>> special swap type. Also add a new callback to handle page fault on
> >>> such entry.
> >>
> >> IIUC this swap type is required only for the mirror cases and its
> >> not a requirement for migration. If it's required for mirroring
> >> purpose where we intercept each page fault, the commit message
> >> here should clearly elaborate on that more.
> > 
> > It is only require for un-addressable memory. The mirroring has nothing to do
> > with it. I will clarify commit message.
> 
> One thing though. I dont recall how persistent memory ZONE_DEVICE
> pages are handled inside the page tables, point here is it should
> be part of the same code block. We should catch that its a device
> memory page and then figure out addressable or not and act
> accordingly. Because persistent memory are CPU addressable, there
> might not been special code block but dealing with device pages 
> should be handled in a more holistic manner.

Before i repost updated patchset i should stress that dealing with un-addressable
device page and addressable one in same block is not do-able without re-doing once
again the whole mm page fault code path. Because i use special swap entry the 
logical place for me to handle it is with where swap entry are handled.

Regular device page are handle bit simpler that other page because they can't be
evicted/swaped so they are always present once faulted. I think right now they
are always populated through fs page fault callback (well dax one).

So not much reasons to consolidate all device page handling in one place. We are
looking at different use case in the end.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13
  2016-11-23  9:16 ` Haggai Eran
@ 2016-11-25 16:16   ` Jerome Glisse
  2016-11-27 13:27     ` Haggai Eran
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-25 16:16 UTC (permalink / raw)
  To: Haggai Eran
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Feras Daoud,
	Ilya Lesokhin, Liran Liss

On Wed, Nov 23, 2016 at 11:16:04AM +0200, Haggai Eran wrote:
> On 11/18/2016 8:18 PM, Jérôme Glisse wrote:
> > Cliff note: HMM offers 2 things (each standing on its own). First
> > it allows to use device memory transparently inside any process
> > without any modifications to process program code. Second it allows
> > to mirror process address space on a device.
> > 
> > Change since v12 is the use of struct page for device memory even if
> > the device memory is not accessible by the CPU (because of limitation
> > impose by the bus between the CPU and the device).
> > 
> > Using struct page means that their are minimal changes to core mm
> > code. HMM build on top of ZONE_DEVICE to provide struct page, it
> > adds new features to ZONE_DEVICE. The first 7 patches implement
> > those changes.
> > 
> > Rest of patchset is divided into 3 features that can each be use
> > independently from one another. First is the process address space
> > mirroring (patch 9 to 13), this allow to snapshot CPU page table
> > and to keep the device page table synchronize with the CPU one.
> > 
> > Second is a new memory migration helper which allow migration of
> > a range of virtual address of a process. This memory migration
> > also allow device to use their own DMA engine to perform the copy
> > between the source memory and destination memory. This can be
> > usefull even outside HMM context in many usecase.
> > 
> > Third part of the patchset (patch 17-18) is a set of helper to
> > register a ZONE_DEVICE node and manage it. It is meant as a
> > convenient helper so that device drivers do not each have to
> > reimplement over and over the same boiler plate code.
> > 
> > 
> > I am hoping that this can now be consider for inclusion upstream.
> > Bottom line is that without HMM we can not support some of the new
> > hardware features on x86 PCIE. I do believe we need some solution
> > to support those features or we won't be able to use such hardware
> > in standard like C++17, OpenCL 3.0 and others.
> > 
> > I have been working with NVidia to bring up this feature on their
> > Pascal GPU. There are real hardware that you can buy today that
> > could benefit from HMM. We also intend to leverage this inside the
> > open source nouveau driver.
> 
> 
> Hi,
> 
> I think the way this new version of the patchset uses ZONE_DEVICE looks
> promising and makes the patchset a little simpler than the previous
> versions.
> 
> The mirroring code seems like it could be used to simplify the on-demand
> paging code in the mlx5 driver and the RDMA subsystem. It currently uses
> mmu notifiers directly.
> 

Yes i plan to spawn a patchset to show how to use HMM to replace some of
the ODP code. I am waiting for patchset to go upstream first before doing
that.

> I'm also curious whether it can be used to allow peer to peer access
> between devices. For instance, if one device calls hmm_vma_get_pfns on a
> process that has unaddressable memory mapped in, with some additional
> help from DMA-API, its driver can convert these pfns to bus addresses
> directed to another device's MMIO region and thus enable peer to peer
> access. Then by handling invalidations through HMM's mirroring callbacks
> it can safely handle cases where the peer migrates the page back to the
> CPU or frees it.

Yes this is something i have work on with NVidia, idea is that you will
see the hmm_pfn_t with the device flag set you can then retrive the struct
device from it. Issue is now to figure out how from that you can know that
this is a device with which you can interact. I would like a common and
device agnostic solution but i think as first step you will need to rely
on some back channel communication.

Once you have setup a peer mapping to the GPU memory its lifetime will be
tie with CPU page table content ie if the CPU page table is updated either
to remove the page (because of munmap/truncate ...) or because the page
is migrated to some other place. In both case the device using the peer
mapping must stop using it and refault to update its page table with the
new page where the data is.

Issue to implement the above lie in the order in which mmu_notifier call-
back are call. We want to tear down the peer mapping only once we know
that any device using it is gone. If all device involve use the HMM mirror
API then this can be solve easily. Otherwise it will need some change to
mmu_notifier.

Note that all of the above would rely on change to DMA-API to allow to
IOMMAP (through iommu) PCI bar address into a device IOMMU context. But
this is an orthogonal issue.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-23  4:03   ` Anshuman Khandual
@ 2016-11-27 13:10     ` Jerome Glisse
  2016-11-28  2:58       ` Anshuman Khandual
  0 siblings, 1 reply; 73+ messages in thread
From: Jerome Glisse @ 2016-11-27 13:10 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On Wed, Nov 23, 2016 at 09:33:35AM +0530, Anshuman Khandual wrote:
> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:

[...]

> > + *
> > + *      hmm_vma_migrate(vma, start, end, ops);
> > + *
> > + * With ops struct providing 2 callback alloc_and_copy() which allocated the
> > + * destination memory and initialize it using source memory. Migration can fail
> > + * after this step and thus last callback finalize_and_map() allow the device
> > + * driver to know which page were successfully migrated and which were not.
> 
> So we have page->pgmap->free_devpage() to release the individual page back
> into the device driver management during migration and also we have this ops
> based finalize_and_mmap() to check on the failed instances inside a single
> migration context which can contain set of pages at a time.
> 
> > + *
> > + * This can easily be use outside of HMM intended use case.
> 
> Where you think this can be used outside of HMM ?

Well on the radar is new memory hierarchy that seems to be on every CPU designer
roadmap. Where you have a fast small HBM like memory package with the CPU and then
you have the regular memory.

In the embedded world they want to migrate active process to fast CPU memory and
shutdown the regular memory to save power.

In the HPC world they want to migrate hot data of hot process to this fast memory.

In both case we are talking about process base memory migration and in case of
embedded they also have DMA engine they can use to offload the copy operation
itself.

This are the useful case i have in mind but other people might see that code and
realise they could also use it for their own specific corner case.

[...]

> > +/*
> > + * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
> > + *
> > + * Flags:
> > + * HMM_PFN_VALID: pfn is valid
> > + * HMM_PFN_WRITE: CPU page table have the write permission set
> > + */
> > +typedef unsigned long hmm_pfn_t;
> > +
> > +#define HMM_PFN_VALID (1 << 0)
> > +#define HMM_PFN_WRITE (1 << 1)
> > +#define HMM_PFN_SHIFT 2
> > +
> > +static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
> > +{
> > +	if (!(pfn & HMM_PFN_VALID))
> > +		return NULL;
> > +	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
> > +}
> > +
> > +static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
> > +{
> > +	if (!(pfn & HMM_PFN_VALID))
> > +		return -1UL;
> > +	return (pfn >> HMM_PFN_SHIFT);
> > +}
> > +
> > +static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
> > +{
> > +	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> > +}
> > +
> > +static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
> > +{
> > +	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
> > +}
> 
> Hmm, so if we use last two bits on PFN as flags, it does reduce the number of
> bits available for the actual PFN range. But given that we support maximum of
> 64TB on POWER (not sure about X86) we can live with this two bits going away
> from the unsigned long. But what is the purpose of tracking validity and write
> flag inside the PFN ?

So 2^46 so with 12bits PAGE_SHIFT we only need 34 bits for pfns value hence i
should have enough place for my flag or is unsigned long not 64bits on powerpc ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13
  2016-11-25 16:16   ` Jerome Glisse
@ 2016-11-27 13:27     ` Haggai Eran
  0 siblings, 0 replies; 73+ messages in thread
From: Haggai Eran @ 2016-11-27 13:27 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Feras Daoud,
	Ilya Lesokhin, Liran Liss

On 11/25/2016 6:16 PM, Jerome Glisse wrote:
> Yes this is something i have work on with NVidia, idea is that you will
> see the hmm_pfn_t with the device flag set you can then retrive the struct
> device from it. Issue is now to figure out how from that you can know that
> this is a device with which you can interact. I would like a common and
> device agnostic solution but i think as first step you will need to rely
> on some back channel communication.
Maybe this can be done with the same DMA-API changes you mention below.
Given two device structs (the peer doing the mapping and the device that
provided the pages) and some (unaddressable) ZONE_DEVICE page structs,
ask the DMA-API to provide bus addresses for that p2p transaction.

> Once you have setup a peer mapping to the GPU memory its lifetime will be
> tie with CPU page table content ie if the CPU page table is updated either
> to remove the page (because of munmap/truncate ...) or because the page
> is migrated to some other place. In both case the device using the peer
> mapping must stop using it and refault to update its page table with the
> new page where the data is.
Sounds good.

> Issue to implement the above lie in the order in which mmu_notifier call-
> back are call. We want to tear down the peer mapping only once we know
> that any device using it is gone. If all device involve use the HMM mirror
> API then this can be solve easily. Otherwise it will need some change to
> mmu_notifier.
I'm not sure I understand how p2p would work this way. If the device
that provides the memory is using HMM for migration it marks the CPU
page tables with the special swap entry. Another device that is not
using HMM mirroring won't be able to translate this into a pfn, even if
it uses mmu notifiers.

> Note that all of the above would rely on change to DMA-API to allow to
> IOMMAP (through iommu) PCI bar address into a device IOMMU context. But
> this is an orthogonal issue.

Even without an IOMMU, I think the DMA-API is a good place to tell
whether p2p is at all possible, or whether it is a good idea in terms of
performance.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-27 13:10     ` Jerome Glisse
@ 2016-11-28  2:58       ` Anshuman Khandual
  2016-11-28  9:41         ` Jerome Glisse
  0 siblings, 1 reply; 73+ messages in thread
From: Anshuman Khandual @ 2016-11-28  2:58 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

On 11/27/2016 06:40 PM, Jerome Glisse wrote:
> On Wed, Nov 23, 2016 at 09:33:35AM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> 
> [...]
> 
>>> + *
>>> + *      hmm_vma_migrate(vma, start, end, ops);
>>> + *
>>> + * With ops struct providing 2 callback alloc_and_copy() which allocated the
>>> + * destination memory and initialize it using source memory. Migration can fail
>>> + * after this step and thus last callback finalize_and_map() allow the device
>>> + * driver to know which page were successfully migrated and which were not.
>>
>> So we have page->pgmap->free_devpage() to release the individual page back
>> into the device driver management during migration and also we have this ops
>> based finalize_and_mmap() to check on the failed instances inside a single
>> migration context which can contain set of pages at a time.
>>
>>> + *
>>> + * This can easily be use outside of HMM intended use case.
>>
>> Where you think this can be used outside of HMM ?
> 
> Well on the radar is new memory hierarchy that seems to be on every CPU designer
> roadmap. Where you have a fast small HBM like memory package with the CPU and then
> you have the regular memory.
> 
> In the embedded world they want to migrate active process to fast CPU memory and
> shutdown the regular memory to save power.
> 
> In the HPC world they want to migrate hot data of hot process to this fast memory.
> 
> In both case we are talking about process base memory migration and in case of
> embedded they also have DMA engine they can use to offload the copy operation
> itself.
> 
> This are the useful case i have in mind but other people might see that code and
> realise they could also use it for their own specific corner case.

If there are plans for HBM or specialized type of memory which will be
packaged inside the CPU (without any other device accessing it like in
the case of GPU or Network Card), then I think in that case using HMM
is not ideal. CPU will be the only thing accessing this memory and
there is never going to be any other device or context which can access
this outside of CPU. Hence role of a device driver is redundant, it
should be initialized and used as a basic platform component.

In that case what we need is a core VM managed memory with certain kind
of restrictions around the allocation and a way of explicit allocation
into it if required. Representing these memory like a cpu less restrictive
coherent device memory node is a better solution IMHO. These RFCs what I
have posted regarding CDM representation are efforts in this direction.

[RFC Specialized Zonelists]    https://lkml.org/lkml/2016/10/24/19
[RFC Restrictive mems_allowed] https://lkml.org/lkml/2016/11/22/339

I believe both HMM and CDM have their own use cases and will complement
each other.

> 
> [...]
> 
>>> +/*
>>> + * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
>>> + *
>>> + * Flags:
>>> + * HMM_PFN_VALID: pfn is valid
>>> + * HMM_PFN_WRITE: CPU page table have the write permission set
>>> + */
>>> +typedef unsigned long hmm_pfn_t;
>>> +
>>> +#define HMM_PFN_VALID (1 << 0)
>>> +#define HMM_PFN_WRITE (1 << 1)
>>> +#define HMM_PFN_SHIFT 2
>>> +
>>> +static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
>>> +{
>>> +	if (!(pfn & HMM_PFN_VALID))
>>> +		return NULL;
>>> +	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
>>> +}
>>> +
>>> +static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
>>> +{
>>> +	if (!(pfn & HMM_PFN_VALID))
>>> +		return -1UL;
>>> +	return (pfn >> HMM_PFN_SHIFT);
>>> +}
>>> +
>>> +static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
>>> +{
>>> +	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
>>> +}
>>> +
>>> +static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
>>> +{
>>> +	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
>>> +}
>>
>> Hmm, so if we use last two bits on PFN as flags, it does reduce the number of
>> bits available for the actual PFN range. But given that we support maximum of
>> 64TB on POWER (not sure about X86) we can live with this two bits going away
>> from the unsigned long. But what is the purpose of tracking validity and write
>> flag inside the PFN ?
> 
> So 2^46 so with 12bits PAGE_SHIFT we only need 34 bits for pfns value hence i
> should have enough place for my flag or is unsigned long not 64bits on powerpc ?

Yeah it is 64 bits on POWER, we use 12 bits of PAGE_SHIFT for 4K
pages and 16 bits of PAGE_SHIFT for 64K pages.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)
  2016-11-28  2:58       ` Anshuman Khandual
@ 2016-11-28  9:41         ` Jerome Glisse
  0 siblings, 0 replies; 73+ messages in thread
From: Jerome Glisse @ 2016-11-28  9:41 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Jatin Kumar,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

> On 11/27/2016 06:40 PM, Jerome Glisse wrote:
> > On Wed, Nov 23, 2016 at 09:33:35AM +0530, Anshuman Khandual wrote:
> >> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
> > 
> > [...]
> > 
> >>> + *
> >>> + *      hmm_vma_migrate(vma, start, end, ops);
> >>> + *
> >>> + * With ops struct providing 2 callback alloc_and_copy() which allocated
> >>> the
> >>> + * destination memory and initialize it using source memory. Migration
> >>> can fail
> >>> + * after this step and thus last callback finalize_and_map() allow the
> >>> device
> >>> + * driver to know which page were successfully migrated and which were
> >>> not.
> >>
> >> So we have page->pgmap->free_devpage() to release the individual page back
> >> into the device driver management during migration and also we have this
> >> ops
> >> based finalize_and_mmap() to check on the failed instances inside a single
> >> migration context which can contain set of pages at a time.
> >>
> >>> + *
> >>> + * This can easily be use outside of HMM intended use case.
> >>
> >> Where you think this can be used outside of HMM ?
> > 
> > Well on the radar is new memory hierarchy that seems to be on every CPU
> > designer
> > roadmap. Where you have a fast small HBM like memory package with the CPU
> > and then
> > you have the regular memory.
> > 
> > In the embedded world they want to migrate active process to fast CPU
> > memory and
> > shutdown the regular memory to save power.
> > 
> > In the HPC world they want to migrate hot data of hot process to this fast
> > memory.
> > 
> > In both case we are talking about process base memory migration and in case
> > of
> > embedded they also have DMA engine they can use to offload the copy
> > operation
> > itself.
> > 
> > This are the useful case i have in mind but other people might see that
> > code and
> > realise they could also use it for their own specific corner case.
> 
> If there are plans for HBM or specialized type of memory which will be
> packaged inside the CPU (without any other device accessing it like in
> the case of GPU or Network Card), then I think in that case using HMM
> is not ideal. CPU will be the only thing accessing this memory and
> there is never going to be any other device or context which can access
> this outside of CPU. Hence role of a device driver is redundant, it
> should be initialized and used as a basic platform component.

AFAIK no CPU can saturate the bandwidth of this memory and thus they only
make sense when there is something like a GPU on die. So in my mind this
kind of memory is always use preferably by a GPU but could still be use by
CPU. In that context you also always have a DMA engine to offload memory
from CPU. I was more selling the HMM migration code in that context :)

 
> In that case what we need is a core VM managed memory with certain kind
> of restrictions around the allocation and a way of explicit allocation
> into it if required. Representing these memory like a cpu less restrictive
> coherent device memory node is a better solution IMHO. These RFCs what I
> have posted regarding CDM representation are efforts in this direction.
> 
> [RFC Specialized Zonelists]    https://lkml.org/lkml/2016/10/24/19
> [RFC Restrictive mems_allowed] https://lkml.org/lkml/2016/11/22/339
> 
> I believe both HMM and CDM have their own use cases and will complement
> each other.

Yes how this memory is represented probably better be represented by something
like CDM. 

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2016-11-28  9:41 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-18 18:18 [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 01/18] mm/memory/hotplug: convert device parameter bool to set of flags Jérôme Glisse
2016-11-21  0:44   ` Balbir Singh
2016-11-21  4:53     ` Jerome Glisse
2016-11-21  6:57       ` Anshuman Khandual
2016-11-21 12:19         ` Jerome Glisse
2016-11-21  6:41   ` Anshuman Khandual
2016-11-21 12:27     ` Jerome Glisse
2016-11-22  5:35       ` Anshuman Khandual
2016-11-22 14:08         ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 02/18] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
2016-11-21  8:06   ` Anshuman Khandual
2016-11-21 12:33     ` Jerome Glisse
2016-11-22  5:15       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 03/18] mm/ZONE_DEVICE/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
2016-11-21  8:18   ` Anshuman Khandual
2016-11-21 12:50     ` Jerome Glisse
2016-11-22  4:30       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 04/18] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
2016-11-21  1:49   ` Balbir Singh
2016-11-21  4:57     ` Jerome Glisse
2016-11-21  8:26   ` Anshuman Khandual
2016-11-21 12:34     ` Jerome Glisse
2016-11-22  5:02       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 05/18] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
2016-11-21 10:37   ` Anshuman Khandual
2016-11-21 12:39     ` Jerome Glisse
2016-11-22  4:54       ` Anshuman Khandual
2016-11-18 18:18 ` [HMM v13 06/18] mm/ZONE_DEVICE/unaddressable: add special swap for unaddressable Jérôme Glisse
2016-11-21  2:06   ` Balbir Singh
2016-11-21  5:05     ` Jerome Glisse
2016-11-22  2:19       ` Balbir Singh
2016-11-22 13:59         ` Jerome Glisse
2016-11-21 11:10     ` Anshuman Khandual
2016-11-21 10:58   ` Anshuman Khandual
2016-11-21 12:42     ` Jerome Glisse
2016-11-22  4:48       ` Anshuman Khandual
2016-11-24 13:56         ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 07/18] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
2016-11-21  2:08   ` Balbir Singh
2016-11-21  5:08     ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2016-11-21  2:29   ` Balbir Singh
2016-11-21  5:14     ` Jerome Glisse
2016-11-23  4:03   ` Anshuman Khandual
2016-11-27 13:10     ` Jerome Glisse
2016-11-28  2:58       ` Anshuman Khandual
2016-11-28  9:41         ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 09/18] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2016-11-21  2:42   ` Balbir Singh
2016-11-21  5:18     ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 10/18] mm/hmm/mirror: add range lock helper, prevent CPU page table update for the range Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 11/18] mm/hmm/mirror: add range monitor helper, to monitor CPU page table update Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 12/18] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 13/18] mm/hmm/mirror: device page fault handler Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 14/18] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 15/18] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 16/18] mm/hmm/migrate: new memory migration helper for use with device memory Jérôme Glisse
2016-11-18 19:57   ` Aneesh Kumar K.V
2016-11-18 20:15     ` Jerome Glisse
2016-11-19 14:32   ` Aneesh Kumar K.V
2016-11-19 17:17     ` Jerome Glisse
2016-11-20 18:21       ` Aneesh Kumar K.V
2016-11-20 20:06         ` Jerome Glisse
2016-11-21  3:30   ` Balbir Singh
2016-11-21  5:31     ` Jerome Glisse
2016-11-18 18:18 ` [HMM v13 17/18] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
2016-11-18 18:18 ` [HMM v13 18/18] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
2016-11-19  0:41 ` [HMM v13 00/18] HMM (Heterogeneous Memory Management) v13 John Hubbard
2016-11-19 14:50   ` Aneesh Kumar K.V
2016-11-23  9:16 ` Haggai Eran
2016-11-25 16:16   ` Jerome Glisse
2016-11-27 13:27     ` Haggai Eran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).