Re: [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2
  2017-01-06 16:46 ` [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
@ 2017-01-06 16:46   ` David Nellans
  2017-01-06 17:13     ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: David Nellans @ 2017-01-06 16:46 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Evgeny Baskakov, Mark Hairgrove, Sherry Cheung,
	Subhash Gutti, Cameron Buschardt, Zi Yan, Anshuman Khandual



On 01/06/2017 10:46 AM, Jérôme Glisse wrote:
> This patch add a new memory migration helpers, which migrate memory
> backing a range of virtual address of a process to different memory
> (which can be allocated through special allocator). It differs from
> numa migration by working on a range of virtual address and thus by
> doing migration in chunk that can be large enough to use DMA engine
> or special copy offloading engine.
>
> Expected users are any one with heterogeneous memory where different
> memory have different characteristics (latency, bandwidth, ...). As
> an example IBM platform with CAPI bus can make use of this feature
> to migrate between regular memory and CAPI device memory. New CPU
> architecture with a pool of high performance memory not manage as
> cache but presented as regular memory (while being faster and with
> lower latency than DDR) will also be prime user of this patch.
Why should the normal page migration path (where neither src nor dest are
device private), use the hmm_migrate functionality?  11-14 are
replicating a lot of the
normal migration functionality but with special casing for HMM
requirements.  When migrating
THP's or a list of pages (your use case above), normal NUMA migration
is going to want to do this as fast as possible too (see Zi Yan's
patches for multi-threading normal
migrations & prototype of using intel IOAT for transfers, he sees 3-5x
speedup).

If the intention is to provide a common interface hook for migration to
use DMA acceleration
(which is a good idea), it probably shouldn't be special cased inside
HMM functionality.
For example, using the intel IOAT for migration DMA has nothing to do
with HMM
whatsoever. We need a normal migration path interface to allow DMA that
isn't tied
to HMM.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15
@ 2017-01-06 16:46 Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
                   ` (16 more replies)
  0 siblings, 17 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Cliff note: HMM offers 2 things (each standing on its own). First
it allows to use device memory transparently inside any process
without any modifications to process program code. Second it allows
to mirror process address space on a device.

Change since v14 are simple:
  - improved comments
  - added debug message when memory hotplug fails because of device
    memory

Device memory pick some random unuse physical address range. Latter
memory hotplug might fails because of this. Intention is to fix this
in latter patchset to use physical address above the platform limit
thus making sure that no real memory can be hotplug at conflicting
address.

I think it is ready for next or at least i would like to know any
reasons to not accept this patchset.

Patchset is divided into 3 features that can each be use independently
from one another. First is changes to ZONE_DEVICE so we can have struct
page for device un-addressable memory (patch 2-6). Second is process
address space mirroring (patch 8 to 10), this allow to snapshot CPU
page table and to keep the device page table synchronize with the CPU
one.

Last is a new page migration helper which allow migration for range of
virtual address using hardware copy engine (patch 11-14).

Other patches just introduce common definitions or add safety net to
catch wrong use of some of the features.


Andrew do you want anyone specific to review any specific part of the
patchset before considering it for inclusion ? At this point i want
to know if there is ever a chance of getting this upstream or do we
decide that we don't want to support this kind of hardware ?


In this patchset i restricted myself to set of core features what
is missing:
  - force read only on CPU for memory duplication and GPU atomic
  - changes to mmu_notifier for optimization purposes
  - migration of file back page to device memory

I plan to submit a couple more patchset to implement those features
once core HMM is upstream.

Git tree:
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v15


Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/
    v15 https://lkml.org/lkml/2016/12/8/344

Cheers,
Jérôme

Jérôme Glisse (16):
  mm/free_hot_cold_page: catch ZONE_DEVICE pages
  mm/memory/hotplug: convert device bool to int to allow for more flags
    v2
  mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device
    memory
  mm/ZONE_DEVICE/free-page: callback when page is freed
  mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
    memory v2
  mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  mm/hmm: heterogeneous memory management (HMM for short)
  mm/hmm/mirror: mirror process address space on device with HMM helpers
  mm/hmm/mirror: helper to snapshot CPU page table
  mm/hmm/mirror: device page fault handler
  mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  mm/hmm/migrate: add new boolean copy flag to migratepage() callback
  mm/hmm/migrate: new memory migration helper for use with device memory
    v2
  mm/hmm/migrate: optimize page map once in vma being migrated
  mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
  mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory

 MAINTAINERS                                |    7 +
 arch/ia64/mm/init.c                        |   23 +-
 arch/powerpc/mm/mem.c                      |   22 +-
 arch/s390/mm/init.c                        |   10 +-
 arch/sh/mm/init.c                          |   22 +-
 arch/tile/mm/init.c                        |   10 +-
 arch/x86/mm/init_32.c                      |   23 +-
 arch/x86/mm/init_64.c                      |   41 +-
 drivers/dax/pmem.c                         |    3 +-
 drivers/nvdimm/pmem.c                      |    7 +-
 drivers/staging/lustre/lustre/llite/rw26.c |    8 +-
 fs/aio.c                                   |    7 +-
 fs/btrfs/disk-io.c                         |   11 +-
 fs/hugetlbfs/inode.c                       |    9 +-
 fs/nfs/internal.h                          |    5 +-
 fs/nfs/write.c                             |    9 +-
 fs/proc/task_mmu.c                         |   10 +-
 fs/ubifs/file.c                            |    8 +-
 include/linux/balloon_compaction.h         |    3 +-
 include/linux/fs.h                         |   13 +-
 include/linux/hmm.h                        |  525 ++++++++++++++
 include/linux/ioport.h                     |    1 +
 include/linux/memory_hotplug.h             |   31 +-
 include/linux/memremap.h                   |   60 +-
 include/linux/migrate.h                    |    7 +-
 include/linux/mm_types.h                   |    5 +
 include/linux/swap.h                       |   18 +-
 include/linux/swapops.h                    |   67 ++
 kernel/fork.c                              |    2 +
 kernel/memremap.c                          |   69 +-
 mm/Kconfig                                 |   51 ++
 mm/Makefile                                |    1 +
 mm/balloon_compaction.c                    |    2 +-
 mm/hmm.c                                   | 1083 ++++++++++++++++++++++++++++
 mm/memory.c                                |   64 +-
 mm/memory_hotplug.c                        |   14 +-
 mm/migrate.c                               |  687 +++++++++++++++++-
 mm/mprotect.c                              |   12 +
 mm/page_alloc.c                            |   10 +
 mm/rmap.c                                  |   47 ++
 mm/zsmalloc.c                              |   12 +-
 tools/testing/nvdimm/test/iomap.c          |    3 +-
 42 files changed, 2935 insertions(+), 87 deletions(-)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.4.3

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-09  9:19   ` Balbir Singh
  2017-01-06 16:46 ` [HMM v15 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
happen as ZONE_DEVICE page must always have an elevated refcount.

This is safety-net to catch any refcounting issues in a sane way for any
ZONE_DEVICE pages.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c24112..355beb4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2445,6 +2445,16 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	/*
+	 * This should never happen ! Page from ZONE_DEVICE always must have an
+	 * active refcount. Complain about it and try to restore the refcount.
+	 */
+	if (is_zone_device_page(page)) {
+		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
+		page_ref_inc(page);
+		return;
+	}
+
 	if (!free_pcp_prepare(page))
 		return;
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

When hotpluging memory we want more informations on the type of memory and
its properties. Replace the device boolean flag by an int and define a set
of flags.

New property for device memory is an opt-in flag to allow page migration
from and to a ZONE_DEVICE. Existing user of ZONE_DEVICE are not expecting
page migration to work for their pages. New changes to page migration i
changing that and we now need a flag to explicitly opt-in page migration.

Changes since v1:
  - Improved commit message
  - Improved define name
  - Improved comments
  - Typos

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/ia64/mm/init.c            | 23 ++++++++++++++++++++---
 arch/powerpc/mm/mem.c          | 22 +++++++++++++++++++---
 arch/s390/mm/init.c            | 10 ++++++++--
 arch/sh/mm/init.c              | 22 +++++++++++++++++++---
 arch/tile/mm/init.c            | 10 ++++++++--
 arch/x86/mm/init_32.c          | 23 ++++++++++++++++++++---
 arch/x86/mm/init_64.c          | 23 ++++++++++++++++++++---
 include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
 include/linux/memremap.h       | 11 +++++++++++
 kernel/memremap.c              |  4 ++--
 mm/memory_hotplug.c            |  4 ++--
 11 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 1841ef6..303027e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,18 +645,27 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	pg_data_t *pgdat;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 
 	if (ret)
@@ -667,13 +676,21 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..6e877d3 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,14 +126,22 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdata;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int rc;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
@@ -147,19 +155,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 
 	/* this should work for most non-highmem platforms */
 	zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, 0, for_device);
+		zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index b3e9d18..e94b9e1 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -149,7 +149,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	unsigned long zone_start_pfn, zone_end_pfn, nr_pages;
 	unsigned long start_pfn = PFN_DOWN(start);
@@ -158,6 +158,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	struct zone *zone;
 	int rc, i;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags) {
+		BUG();
+		return -EINVAL;
+	}
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
@@ -201,7 +207,7 @@ unsigned long memory_block_size_bytes(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..0ca69ac 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,19 +485,27 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	pg_data_t *pgdat;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
 	ret = __add_pages(nid, pgdat->node_zones +
 			zone_for_memory(nid, start, size, ZONE_NORMAL,
-			for_device),
+					flags & MEMORY_DEVICE),
 			start_pfn, nr_pages);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
@@ -516,13 +524,21 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (unlikely(ret))
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index adce254..ba001b1 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -863,13 +863,19 @@ void __init mem_init(void)
  * memory to the highmem for now.
  */
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size, bool for_device)
+int arch_add_memory(u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdata = &contig_page_data;
 	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(zone, start_pfn, nr_pages);
 }
 
@@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/* TODO */
 	return -EBUSY;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cf80590..8287a4b 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,24 +816,41 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdata = NODE_DATA(nid);
 	struct zone *zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
+		zone_for_memory(nid, start, size, ZONE_HIGHMEM,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	return __remove_pages(zone, start_pfn, nr_pages);
 }
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 14b9dd7..442ac86 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -651,15 +651,24 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -956,8 +965,10 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(start_pfn);
@@ -965,6 +976,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 01033fa..3f50eb8 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -103,7 +103,7 @@ extern bool memhp_auto_online;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, int flags);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -275,7 +275,27 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+
+/*
+ * When hotpluging memory with arch_add_memory() we want more informations on
+ * the type of memory and its properties. The flags parameter allow to provide
+ * more informations on the memory which is being addedd.
+ *
+ * Provide an opt-in flag for struct page migration. Persistent device memory
+ * never relied on struct page migration so far and new user of might also
+ * prefer avoiding struct page migration.
+ *
+ * New non device memory specific flags can be added if ever needed.
+ *
+ * MEMORY_REGULAR: regular system memory
+ * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
+ * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ */
+#define MEMORY_NORMAL 0
+#define MEMORY_DEVICE (1 << 0)
+#define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+
+extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..f7e0609 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,6 +53,12 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -71,6 +77,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..07665eb 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size);
+	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -358,7 +358,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, true);
+	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
 	if (error)
 		goto err_add_memory;
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e43142c1..096c651 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1372,7 +1372,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	}
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, false);
+	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
 
 	if (ret < 0)
 		goto error;
@@ -2156,7 +2156,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
+	arch_remove_memory(start, size, MEMORY_NORMAL);
 
 	try_offline_node(nid);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 17:58   ` Dan Williams
  2017-01-06 16:46 ` [HMM v15 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

Some device driver manage multiple physical devices memory from a single
fake device driver. In that case the fake device might outlive the real
device and ZONE_DEVICE and its resource allocated for a real device would
waste resources in the meantime.

This patch allow early removal of ZONE_DEVICE and associated resource,
before device driver is tear down.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h |  7 +++++++
 kernel/memremap.c        | 14 ++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f7e0609..32314d2 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,6 +53,7 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
 static inline bool dev_page_allow_migrate(const struct page *page)
 {
@@ -78,6 +79,12 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
+static inline int devm_memremap_pages_remove(struct device *dev,
+					     struct dev_pagemap *pgmap)
+{
+	return -EINVAL;
+}
+
 static inline bool dev_page_allow_migrate(const struct page *page)
 {
 	return false;
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07665eb..250ef25 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -387,6 +387,20 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 }
 EXPORT_SYMBOL(devm_memremap_pages);
 
+static int devm_page_map_match(struct device *dev, void *data, void *match_data)
+{
+	struct page_map *page_map = data;
+
+	return &page_map->pgmap == match_data;
+}
+
+int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap)
+{
+	return devres_release(dev, &devm_memremap_pages_release,
+			      &devm_page_map_match, pgmap);
+}
+EXPORT_SYMBOL(devm_memremap_pages_remove);
+
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
 	/* number of pfns from base where pfn_to_page() is valid */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2 Jérôme Glisse
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
is holding a reference on it (only device to which the memory belong do).
Add a callback and call it when that happen so device driver can implement
their own free page management.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/dax/pmem.c                |  3 ++-
 drivers/nvdimm/pmem.c             |  5 +++--
 include/linux/memremap.h          | 17 ++++++++++++++---
 kernel/memremap.c                 | 14 +++++++++++++-
 tools/testing/nvdimm/test/iomap.c |  2 +-
 5 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 033f49b3..66af7b1 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -111,7 +111,8 @@ static int dax_pmem_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap);
+	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
+				   altmap, NULL, NULL);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8755317..f2f1904 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -282,7 +282,7 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
-				altmap);
+					   altmap, NULL, NULL);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -291,7 +291,8 @@ static int pmem_attach_disk(struct device *dev,
 		res->start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
-				&q->q_usage_counter, NULL);
+					   &q->q_usage_counter,
+					   NULL, NULL, NULL);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 32314d2..7845f2e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,23 +35,31 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_free: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @data: privata data pointer for page_free
  */
 struct dev_pagemap {
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap);
+			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  dev_page_free_t page_free,
+			  void *data);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
@@ -62,8 +70,11 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 }
 #else
 static inline void *devm_memremap_pages(struct device *dev,
-		struct resource *res, struct percpu_ref *ref,
-		struct vmem_altmap *altmap)
+					struct resource *res,
+					struct percpu_ref *ref,
+					struct vmem_altmap *altmap,
+					dev_page_free_t page_free,
+					void *data)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 250ef25..bc1e400 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (page->pgmap->page_free && page_count(page) == 1)
+		page->pgmap->page_free(page, page->pgmap->data);
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
@@ -270,6 +276,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  * @res: "host memory" address range
  * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
+ * @page_free: callback call when page refcount reach 1 ie it is free
+ * @data: privata data pointer for page_free
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -280,7 +288,9 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *    this is not enforced.
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap)
+			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  dev_page_free_t page_free,
+			  void *data)
 {
 	resource_size_t key, align_start, align_size, align_end;
 	pgprot_t pgprot = PAGE_KERNEL;
@@ -322,6 +332,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->page_free = page_free;
+	pgmap->data = data;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index 64cae1a..9992a7c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -112,7 +112,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,
 
 	if (nfit_res)
 		return nfit_res->buf + offset - nfit_res->res.start;
-	return devm_memremap_pages(dev, res, ref, altmap);
+	return devm_memremap_pages(dev, res, ref, altmap, NULL, NULL);
 }
 EXPORT_SYMBOL(__wrap_devm_memremap_pages);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 06/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

This add support for un-addressable device memory. Such memory is hotpluged
only so we can have struct page but we should never map them as such memory
can not be accessed by CPU. For that reason it uses a special swap entry for
CPU page table entry.

This patch implement all the logic from special swap type to handling CPU
page fault through a callback specified in the ZONE_DEVICE pgmap struct.

Architecture that wish to support un-addressable device memory should make
sure to never populate the kernel linar mapping for the physical range.

This feature potentially breaks memory hotplug unless every driver using it
magically predicts the future addresses of where memory will be hotplugged.

Changed since v1:
  - Add unaddressable memory resource descriptor enum
  - Explain why memory hotplug can fail because of un-addressable memory

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/dax/pmem.c                |  4 +--
 drivers/nvdimm/pmem.c             |  6 ++--
 fs/proc/task_mmu.c                | 10 +++++-
 include/linux/ioport.h            |  1 +
 include/linux/memory_hotplug.h    |  7 ++++
 include/linux/memremap.h          | 29 +++++++++++++++--
 include/linux/swap.h              | 18 +++++++++--
 include/linux/swapops.h           | 67 +++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c                 | 43 +++++++++++++++++++++++--
 mm/Kconfig                        | 12 +++++++
 mm/memory.c                       | 64 ++++++++++++++++++++++++++++++++++++-
 mm/memory_hotplug.c               | 10 ++++--
 mm/mprotect.c                     | 12 +++++++
 tools/testing/nvdimm/test/iomap.c |  3 +-
 14 files changed, 269 insertions(+), 17 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 66af7b1..c50b58d 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -111,8 +111,8 @@ static int dax_pmem_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
-				   altmap, NULL, NULL);
+	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap,
+				   NULL, NULL, NULL, NULL, MEMORY_DEVICE);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f2f1904..8166a56 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -282,7 +282,8 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
-					   altmap, NULL, NULL);
+					   altmap, NULL, NULL, NULL,
+					   NULL, MEMORY_DEVICE);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -292,7 +293,8 @@ static int pmem_attach_disk(struct device *dev,
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
 					   &q->q_usage_counter,
-					   NULL, NULL, NULL);
+					   NULL, NULL, NULL, NULL,
+					   NULL, MEMORY_DEVICE);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 958f325..9a6ab71 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -535,8 +535,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			} else {
 				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
 			}
-		} else if (is_migration_entry(swpent))
+		} else if (is_migration_entry(swpent)) {
 			page = migration_entry_to_page(swpent);
+		} else if (is_device_entry(swpent)) {
+			page = device_entry_to_page(swpent);
+		}
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -699,6 +702,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1182,6 +1187,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_entry(entry))
+			page = device_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index 6230064..d154a18 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -130,6 +130,7 @@ enum {
 	IORES_DESC_ACPI_NV_STORAGE		= 3,
 	IORES_DESC_PERSISTENT_MEMORY		= 4,
 	IORES_DESC_PERSISTENT_MEMORY_LEGACY	= 5,
+	IORES_DESC_UNADDRESSABLE_MEMORY		= 6,
 };
 
 /* helpers to define resources */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3f50eb8..e7c5dc6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -285,15 +285,22 @@ extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
  * never relied on struct page migration so far and new user of might also
  * prefer avoiding struct page migration.
  *
+ * For device memory (which use ZONE_DEVICE) we want differentiate between CPU
+ * accessible memory (persitent memory, device memory on an architecture with a
+ * system bus that allow transparent access to device memory) and unaddressable
+ * memory (device memory that can not be accessed by CPU directly).
+ *
  * New non device memory specific flags can be added if ever needed.
  *
  * MEMORY_REGULAR: regular system memory
  * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
  * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ * MEMORY_DEVICE_UNADDRESSABLE: un-addressable memory (CPU can not access it)
  */
 #define MEMORY_NORMAL 0
 #define MEMORY_DEVICE (1 << 0)
 #define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+#define MEMORY_DEVICE_UNADDRESSABLE (1 << 2)
 
 extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7845f2e..a646c47 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,31 +35,42 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				struct page *page,
+				unsigned flags,
+				pmd_t *pmdp);
 typedef void (*dev_page_free_t)(struct page *page, void *data);
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an un-addressable device page
  * @page_free: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
  * @data: privata data pointer for page_free
+ * @flags: device memory flags (look for MEMORY_DEVICE_* memory_hotplug.h)
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
 	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
 	void *data;
+	int flags;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  struct dev_pagemap **ppgmap,
+			  dev_page_fault_t page_fault,
 			  dev_page_free_t page_free,
-			  void *data);
+			  void *data, int flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
@@ -68,13 +79,22 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 	return ((page_zonenum(page) == ZONE_DEVICE) &&
 		(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return ((page_zonenum(page) != ZONE_DEVICE) ||
+		!(page->pgmap->flags & MEMORY_DEVICE_UNADDRESSABLE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 					struct resource *res,
 					struct percpu_ref *ref,
 					struct vmem_altmap *altmap,
+					struct dev_pagemap **ppgmap,
+					dev_page_fault_t page_fault,
 					dev_page_free_t page_free,
-					void *data)
+					void *data,
+					int flags)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
@@ -100,6 +120,11 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return true;
+}
 #endif
 
 /**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 09b212d..81b44ea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Un-addressable device memory support
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -410,8 +422,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..0e339f0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bc1e400..3df08f4 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -200,6 +202,21 @@ void put_zone_device_page(struct page *page)
 }
 EXPORT_SYMBOL(put_zone_device_page);
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_entry_to_page(entry);
+
+	BUG_ON(!page->pgmap->page_fault);
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -252,7 +269,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
+	arch_remove_memory(align_start, align_size, pgmap->flags);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -276,8 +293,11 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  * @res: "host memory" address range
  * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
+ * @ppgmap: pointer set to new page dev_pagemap on success
+ * @page_fault: callback for CPU page fault on un-addressable memory
  * @page_free: callback call when page refcount reach 1 ie it is free
  * @data: privata data pointer for page_free
+ * @flags: device memory flags (look for MEMORY_DEVICE_* memory_hotplug.h)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -289,8 +309,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  struct dev_pagemap **ppgmap,
+			  dev_page_fault_t page_fault,
 			  dev_page_free_t page_free,
-			  void *data)
+			  void *data, int flags)
 {
 	resource_size_t key, align_start, align_size, align_end;
 	pgprot_t pgprot = PAGE_KERNEL;
@@ -299,6 +321,17 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	int error, nid, is_ram;
 	unsigned long pfn;
 
+	if (!(flags & MEMORY_DEVICE)) {
+		WARN_ONCE(1, "%s attempted on non device memory\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (altmap && (flags & MEMORY_DEVICE_UNADDRESSABLE)) {
+		WARN_ONCE(1, "%s with altmap for un-addressable "
+			  "device memory\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
 		- align_start;
@@ -332,8 +365,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->page_fault = page_fault;
 	pgmap->page_free = page_free;
 	pgmap->data = data;
+	pgmap->flags = flags;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
@@ -370,7 +405,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
+	error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
 	if (error)
 		goto err_add_memory;
 
@@ -387,6 +422,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		page->pgmap = pgmap;
 	}
 	devres_add(dev, page_map);
+	if (ppgmap)
+		*ppgmap = pgmap;
 	return __va(res->start);
 
  err_add_memory:
diff --git a/mm/Kconfig b/mm/Kconfig
index e9b7c7e..0c33f46 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -700,6 +700,18 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_UNADDRESSABLE
+	bool "Un-addressable device memory (GPU memory, ...)"
+	depends on ZONE_DEVICE
+
+	help
+	  Allow to create struct page for un-addressable device memory
+	  ie memory that is only accessible by the device (or group of
+	  devices).
+
+	  Having struct page is necessary for process memory migration
+	  to device memory.
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index e870322..69bede9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -890,6 +891,25 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_entry(entry)) {
+			page = device_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for un-addressable page as
+			 * they should be consider just like any other page.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			if (is_write_device_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*src_pte))
+					pte = pte_swp_mksoft_dirty(pte);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1179,6 +1199,32 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
+
+		/*
+		 * Un-addressable page must always be check that are not like
+		 * other swap entries and thus should be check no matter what
+		 * details->check_swap_entries value is.
+		 */
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_entry(entry)) {
+			struct page *page = device_entry_to_page(entry);
+
+			if (unlikely(details)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping &&
+				    details->check_mapping != page_rmapping(page))
+					continue;
+			}
+
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+		}
+
 		/* only check swap_entries if explicitly asked for in details */
 		if (unlikely(details && !details->check_swap_entries))
 			continue;
@@ -2550,6 +2596,14 @@ int do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_entry_fault(vma, vmf->address, entry,
+						 vmf->flags, vmf->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
@@ -3518,6 +3572,7 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 static int handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
+	struct page *page;
 
 	if (unlikely(pmd_none(*vmf->pmd))) {
 		/*
@@ -3568,9 +3623,16 @@ static int handle_pte_fault(struct vm_fault *vmf)
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
 
+	/* Catch mapping of un-addressable memory this should never happen */
+	entry = vmf->orig_pte;
+	page = pfn_to_page(pte_pfn(entry));
+	if (!is_addressable_page(page)) {
+		print_bad_pte(vmf->vma, vmf->address, entry, page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
-	entry = vmf->orig_pte;
 	if (unlikely(!pte_same(*vmf->pte, entry)))
 		goto unlock;
 	if (vmf->flags & FAULT_FLAG_WRITE) {
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 096c651..76f5359 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -149,7 +149,7 @@ void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res;
+	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
 	if (!res)
 		return ERR_PTR(-ENOMEM);
@@ -158,7 +158,13 @@ static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	if (request_resource(&iomem_resource, res) < 0) {
+	conflict =  request_resource_conflict(&iomem_resource, res);
+	if (conflict) {
+		if (conflict->desc == IORES_DESC_UNADDRESSABLE_MEMORY) {
+			pr_debug("Device un-addressable memory block "
+				 "memory hotplug at %#010llx !\n",
+				 (unsigned long long)start);
+		}
 		pr_debug("System RAM resource %pR cannot be added\n", res);
 		kfree(res);
 		return ERR_PTR(-EEXIST);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cc2459c..fc3dd08 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -140,6 +140,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_entry(entry)) {
+				pte_t newpte;
+
+				make_device_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index 9992a7c..19902c9 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -112,7 +112,8 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,
 
 	if (nfit_res)
 		return nfit_res->buf + offset - nfit_res->res.start;
-	return devm_memremap_pages(dev, res, ref, altmap, NULL, NULL);
+	return devm_memremap_pages(dev, res, ref, altmap, NULL,
+				   NULL, NULL, NULL, MEMORY_DEVICE);
 }
 EXPORT_SYMBOL(__wrap_devm_memremap_pages);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 06/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2 Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 07/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.

Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/mm/init_64.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 442ac86..6e7f613 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -654,7 +654,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	const int supported_flags = MEMORY_DEVICE |
-				    MEMORY_DEVICE_ALLOW_MIGRATE;
+				    MEMORY_DEVICE_ALLOW_MIGRATE |
+				    MEMORY_DEVICE_UNADDRESSABLE;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL,
@@ -669,7 +670,17 @@ int arch_add_memory(int nid, u64 start, u64 size, int flags)
 		return -EINVAL;
 	}
 
-	init_memory_mapping(start, start + size);
+	/*
+	 * We get un-addressable memory when some one is adding a ZONE_DEVICE
+	 * to have struct page for a device memory which is not accessible by
+	 * the CPU so it is pointless to have a linear kernel mapping of such
+	 * memory.
+	 *
+	 * Core mm should make sure it never set a pte pointing to such fake
+	 * physical range.
+	 */
+	if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+		init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
@@ -968,7 +979,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
 	const int supported_flags = MEMORY_DEVICE |
-				    MEMORY_DEVICE_ALLOW_MIGRATE;
+				    MEMORY_DEVICE_ALLOW_MIGRATE |
+				    MEMORY_DEVICE_UNADDRESSABLE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(start_pfn);
@@ -989,7 +1001,9 @@ int __ref arch_remove_memory(u64 start, u64 size, int flags)
 	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-	kernel_physical_mapping_remove(start, start + size);
+
+	if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+		kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 07/16] mm/hmm: heterogeneous memory management (HMM for short)
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 06/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

HMM provides 3 separate functionality :
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 MAINTAINERS              |   7 +++
 include/linux/hmm.h      | 150 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |   4 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 |  82 ++++++++++++++++++++++++++
 7 files changed, 251 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 606c43e..f119a0c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5762,6 +5762,13 @@ S:	Supported
 F:	drivers/scsi/hisi_sas/
 F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	linux-wireless@vger.kernel.org
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..f00d519
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,150 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * HMM provides 3 separate functionality :
+ *   - Mirroring: synchronize CPU page table and device page table
+ *   - Device memory: allocating struct page for device memory
+ *   - Migration: migrating regular memory to device memory
+ *
+ * Each can be used independently from the others.
+ *
+ *
+ * Mirroring:
+ *
+ * HMM provide helpers to mirror process address space on a device. For this it
+ * provides several helpers to order device page table update in respect to CPU
+ * page table update. Requirement is that for any given virtual address the CPU
+ * and device page table can not point to different physical page. It uses the
+ * mmu_notifier API behind the scene.
+ *
+ * Device memory:
+ *
+ * HMM provides helpers to help leverage device memory either addressable like
+ * regular memory by the CPU or un-addressable at all. In both case the device
+ * memory is associated to dedicated structs page (which are allocated like for
+ * hotplug memory). Device memory management is under the responsibility of the
+ * device driver. HMM only allocate and initialize the struct pages associated
+ * with the device memory by hotpluging a ZONE_DEVICE memory range.
+ *
+ * Allocating struct page for device memory allow to use device memory allmost
+ * like any regular memory. Unlike regular memory it can not be added to the
+ * lru, nor can any memory allocation can use device memory directly. Device
+ * memory will only end up to be use in a process if device driver migrate some
+ * of the process memory from regular memory to device memory.
+ *
+ *
+ * Migration:
+ *
+ * Existing memory migration mechanism (mm/migrate.c) does not allow to use
+ * something else than the CPU to copy from source to destination memory. More
+ * over existing code is not tailor to drive migration from process virtual
+ * address rather than from list of pages. Finaly the migration flow does not
+ * allow for graceful failure at different step of the migration process.
+ *
+ * HMM solves all of the above through simple API :
+ *
+ *      hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private);
+ *
+ * With ops struct providing 2 callback alloc_and_copy() which allocated the
+ * destination memory and initialize it using source memory. Migration can fail
+ * after this step and thus last callback finalize_and_map() allow the device
+ * driver to know which page were successfully migrated and which were not.
+ *
+ * This can easily be use outside of HMM intended use case.
+ *
+ *
+ * This header file contain all the API related to this 3 functionality and
+ * each functions and struct are more thoroughly documented in below comments.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table have the write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value store in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only ! Not to be used by device driver ! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only ! Not to be used by device driver ! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8aced..4effdbf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -516,6 +517,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM need to track few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index cfee5ec..98a297f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -843,6 +844,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 0c33f46..9cdf361e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,10 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config HMM
+	bool
+	depends on MMU
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a..e4d9f48 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..e891fdd
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be use directly by device driver but by other HMM
+ * component. It allocates an HMM struct if mm does not have one and initialize
+ * it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be free once mm_struct goes away
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	/*
+	 * We should not need to lock here as no one should be able to register
+	 * a new HMM while an mm is being destroy. But just to be safe ...
+	 */
+	spin_lock(&mm->page_table_lock);
+	hmm = mm->hmm;
+	mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+	kfree(hmm);
+}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 07/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 09/16] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 101 ++++++++++++++++++++++++++++
 mm/Kconfig          |  15 +++++
 mm/hmm.c            | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 301 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f00d519..31e2c50 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
@@ -138,6 +139,106 @@ static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to use synchronize device page table with CPU page table ?
+ *
+ * Device driver must always synchronize with CPU page table update, for this
+ * they can either directly use mmu_notifier API or they can use the hmm_mirror
+ * API. Device driver can decide to register one mirror per device per process
+ * or just one mirror per process for a group of device. Pattern is :
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *          int ret;
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contain an hmm_mirror struct as one of its field.
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *          // Other device driver specific initialization
+ *      }
+ *
+ * Device driver must not free the struct containing hmm_mirror struct before
+ * calling hmm_mirror_unregister() expected usage is to do that when device
+ * driver is unbinding from an address space.
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          hmm_mirror_unregister(&das->mirror);
+ *          // Other device driver specific cleanup and now das can be free
+ *      }
+ *
+ * Once an hmm_mirror is register for an address space, device driver will get
+ * callback through the update() operation (see hmm_mirror_ops struct).
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* update() - update virtual address range of memory
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update: update's type (turn read only, unmap, ...)
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback is call when the CPU page table is updated, the device
+	 * driver must update device page table accordingly to update's action.
+	 *
+	 * Device driver callback must wait until device have fully updated its
+	 * view for the range. Note we plan to make this asynchronous in later
+	 * patches. So that multiple devices can schedule update to their page
+	 * table and once all device have schedule the update then we wait for
+	 * them to propagate.
+	 */
+	void (*update)(struct hmm_mirror *mirror,
+		       enum hmm_update action,
+		       unsigned long start,
+		       unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * of hmm_mirror struct with HMM. HMM will track list of all mirrors for each
+ * mm_struct (or each process).
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+			       struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 9cdf361e..598c38a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -293,6 +293,21 @@ config HMM
 	bool
 	depends on MMU
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	select HMM
+	select MMU_NOTIFIER
+	help
+	  HMM mirror is a set of helpers to mirror CPU page table into a device
+	  page table. There is two side, first keep both page table synchronize
+	  so that no virtual address can point to different page (but one page
+	  table might lag ie onee might still point to page while the other is
+	  is pointing to nothing).
+
+	  Second side of the equation is replicating CPU page table content for
+	  range of virtual address. This require careful synchronization with
+	  CPU page table update.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index e891fdd..b725c6d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,27 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting mirrors list
+ * @mirrors: list of mirrors for this mm
+ * @wait_queue: wait queue
+ * @sequence: we track update to CPU page table with a sequence number
+ * @mmu_notifier: mmu notifier to track update to CPU page table
+ * @notifier_count: number of currently active notifier count
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	atomic_t		sequence;
+	wait_queue_head_t	wait_queue;
+	struct mmu_notifier	mmu_notifier;
+	atomic_t		notifier_count;
 };
 
 /*
@@ -48,6 +61,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 		if (!hmm)
 			return NULL;
+		init_waitqueue_head(&hmm->wait_queue);
+		atomic_set(&hmm->notifier_count, 0);
+		INIT_LIST_HEAD(&hmm->mirrors);
+		atomic_set(&hmm->sequence, 0);
+		hmm->mmu_notifier.ops = NULL;
+		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 
 		spin_lock(&mm->page_table_lock);
@@ -80,3 +99,169 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 	kfree(hmm);
 }
+
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	/*
+	 * Mirror being added or remove is a rare event so list traversal isn't
+	 * protected by a lock, we rely on simple rules. All list modification
+	 * are done using list_add_rcu() and list_del_rcu() under a spinlock to
+	 * protect from concurrent addition or removal but not traversal.
+	 *
+	 * Because hmm_mirror_unregister() wait for all running invalidation to
+	 * complete (and thus all list traversal to finish). None of the mirror
+	 * struct can be freed from under us while traversing the list and thus
+	 * it is safe to dereference their list pointer even if they were just
+	 * remove.
+	 */
+	list_for_each_entry (mirror, &hmm->mirrors, list)
+		mirror->ops->update(mirror, action, start, end);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	/* Reverse order here because we are getting out of invalidation */
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+static int hmm_mirror_do_register(struct hmm_mirror *mirror,
+				  struct mm_struct *mm,
+				  const bool locked)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	/* Register mmu_notifier if not already, use mmap_sem for locking */
+	if (!mirror->hmm->mmu_notifier.ops) {
+		struct hmm *hmm = mirror->hmm;
+
+		if (!locked)
+			down_write(&mm->mmap_sem);
+		if (!hmm->mmu_notifier.ops) {
+			hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+			if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+				hmm->mmu_notifier.ops = NULL;
+				up_write(&mm->mmap_sem);
+				return -ENOMEM;
+			}
+		}
+		if (!locked)
+			up_write(&mm->mmap_sem);
+	}
+
+	spin_lock(&mirror->hmm->lock);
+	list_add_rcu(&mirror->list, &mirror->hmm->mirrors);
+	spin_unlock(&mirror->hmm->lock);
+
+	return 0;
+}
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space device driver must register an
+ * HMM mirror struct.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return hmm_mirror_do_register(mirror, mm, false);
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_register_locked() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * Same as hmm_mirror_register() except that mmap_sem must write locked !
+ */
+int hmm_mirror_register_locked(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return hmm_mirror_do_register(mirror, mm, true);
+}
+EXPORT_SYMBOL(hmm_mirror_register_locked);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&mirror->list);
+	spin_unlock(&hmm->lock);
+
+	/*
+	 * Wait for all active notifier so that it is safe to traverse mirror
+	 * list without any lock.
+	 */
+	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 09/16] mm/hmm/mirror: helper to snapshot CPU page table
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 10/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  56 +++++++++++-
 mm/hmm.c            | 257 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 311 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 31e2c50..b5eafdc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,13 +83,28 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ: read permission set
  * HMM_PFN_WRITE: CPU page table have the write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
+ * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
+ *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
+ *      device (the entry will never have HMM_PFN_VALID set and the pfn value
+ *      is undefine)
+ * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_DEVICE (1 << 5)
+#define HMM_PFN_SPECIAL (1 << 6)
+#define HMM_PFN_UNADDRESSABLE (1 << 7)
+#define HMM_PFN_SHIFT 8
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -236,6 +251,43 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 int hmm_mirror_register_locked(struct hmm_mirror *mirror,
 			       struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot CPU page table call hmm_vma_get_pfns() then take device driver
+ * lock that serialize device page table update and call hmm_vma_range_done()
+ * to check if snapshot is still valid. The device driver page table update
+ * lock must also be use in the HMM mirror update() callback so that CPU page
+ * table invalidation serialize on it.
+ *
+ * YOU MUST CALL hmm_vma_range_dond() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index b725c6d..0ef06df 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,10 +19,15 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
+
 /*
  * struct hmm - HMM per mm struct
  *
@@ -37,6 +42,7 @@
 struct hmm {
 	struct mm_struct	*mm;
 	spinlock_t		lock;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	atomic_t		sequence;
 	wait_queue_head_t	wait_queue;
@@ -66,6 +72,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		INIT_LIST_HEAD(&hmm->mirrors);
 		atomic_set(&hmm->sequence, 0);
 		hmm->mmu_notifier.ops = NULL;
+		INIT_LIST_HEAD(&hmm->ranges);
 		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 
@@ -108,6 +115,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	rcu_read_unlock();
 
 	/*
 	 * Mirror being added or remove is a rare event so list traversal isn't
@@ -264,4 +287,238 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_empty(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_EMPTY;
+}
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static void hmm_vma_walk(struct vm_area_struct *vma,
+			 unsigned long start,
+			 unsigned long end,
+			 hmm_pfn_t *pfns)
+{
+	unsigned long addr, next;
+	hmm_pfn_t flag;
+
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	for (addr = start; addr < end; addr = next) {
+		unsigned long i = (addr - start) >> PAGE_SHIFT;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+		pmd_t pmd;
+
+		/*
+		 * We are accessing/faulting for a device from an unknown
+		 * thread that might be foreign to the mm we are faulting
+		 * against so do not call arch_vma_access_permitted() !
+		 */
+
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset(vma->vm_mm, addr);
+		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		if (pud_none(*pudp) || pud_bad(*pudp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd) || pmd_bad(pmd)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+			hmm_pfn_t flags = flag;
+
+			if (pmd_protnone(pmd)) {
+				hmm_pfns_clear(&pfns[i], addr, next);
+				continue;
+			}
+			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
+			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
+			continue;
+		}
+
+		ptep = pte_offset_map(pmdp, addr);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			pte_t pte = *ptep;
+
+			pfns[i] = 0;
+
+			if (pte_none(pte)) {
+				pfns[i] = HMM_PFN_EMPTY;
+				continue;
+			}
+
+			entry = pte_to_swp_entry(pte);
+			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				continue;
+			}
+
+			if (pte_present(pte)) {
+				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
+				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+				continue;
+			}
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			*/
+			if (is_device_entry(entry)) {
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
+				if (is_write_device_entry(entry))
+					pfns[i] |= HMM_PFN_WRITE;
+				pfns[i] |= HMM_PFN_DEVICE;
+				pfns[i] |= HMM_PFN_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (!is_migration_entry(entry)) {
+				pfns[i] = HMM_PFN_ERROR;
+			}
+		}
+		pte_unmap(ptep - 1);
+	}
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual address
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t provided by caller fill by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshot the CPU page table for a range of virtual address, snapshot
+ * validity is track by the range struct see hmm_vma_range_done() for further
+ * informations.
+ *
+ * The range struct is initialized and track CPU page table only if function
+ * returns success (0) then you must call hmm_vma_range_done() to stop range
+ * CPU page table update tracking.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARN !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have register a mirror (with hmm_mirror_register()) ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	hmm_vma_walk(vma, start, end, pfns);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being track
+ * Returns: false if range data have been invalidated, true otherwise
+ *
+ * Range struct is use to track update to CPU page table after call to
+ * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
+ * to data it gots from this function it calls hmm_vma_range_done() which stop
+ * the tracking.
+ *
+ * There is 2 way to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 10/16] mm/hmm/mirror: device page fault handler
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 09/16] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  26 +++++
 mm/hmm.c            | 269 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 267 insertions(+), 28 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b5eafdc..f19c2a0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -288,6 +288,32 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver unlike handle_mm_fault() it will not
+ * migrate any device memory back to system memory. The hmm_pfn_t array will be
+ * updated with fault result and current snapshot of the CPU page table for the
+ * range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be drop
+ * by the function if block argument is false, when that happen the function
+ * returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successfull for every single
+ * address or not, you need to inspect the hmm_pfn_t array to determine fault
+ * status for that address. Trying to fault inside an invalid vma will result
+ * in -EINVAL.
+ *
+ * See function description in mm/hmm.c for documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 0ef06df..a397d45 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -288,6 +288,15 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+
+static void hmm_pfns_error(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_ERROR;
+}
+
 static void hmm_pfns_empty(hmm_pfn_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
@@ -304,10 +313,43 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
 		*pfns = HMM_PFN_SPECIAL;
 }
 
-static void hmm_vma_walk(struct vm_area_struct *vma,
-			 unsigned long start,
-			 unsigned long end,
-			 hmm_pfn_t *pfns)
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	unsigned long npfns = (end - addr) >> PAGE_SHIFT;
+
+	memset(pfns, 0, sizeof(*pfns) * npfns);
+}
+
+static int hmm_vma_do_fault(struct vm_area_struct *vma,
+			    const hmm_pfn_t fault,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn,
+			    bool block)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	int r;
+
+	flags |= block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= (fault & HMM_PFN_WRITE) ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EAGAIN;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int hmm_vma_walk(struct vm_area_struct *vma,
+			const hmm_pfn_t fault,
+			unsigned long start,
+			unsigned long end,
+			hmm_pfn_t *pfns,
+			bool block)
 {
 	unsigned long addr, next;
 	hmm_pfn_t flag;
@@ -321,6 +363,7 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd_t *pmdp;
 		pte_t *ptep;
 		pmd_t pmd;
+		int ret;
 
 		/*
 		 * We are accessing/faulting for a device from an unknown
@@ -331,15 +374,37 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset(vma->vm_mm, addr);
 		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pudp = pud_alloc(vma->vm_mm, pgdp, addr);
+			if (!pudp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pud_addr_end(addr, end);
 		pudp = pud_offset(pgdp, addr);
 		if (pud_none(*pudp) || pud_bad(*pudp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pmdp = pmd_alloc(vma->vm_mm, pudp, addr);
+			if (!pmdp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pmd_addr_end(addr, end);
@@ -347,8 +412,24 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd = pmd_read_atomic(pmdp);
 		barrier();
 		if (pmd_none(pmd) || pmd_bad(pmd)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			/*
+			 * Use pte_alloc() instead of pte_alloc_map, because we
+			 * can't run pte_offset_map on the pmd, if an huge pmd
+			 * could materialize from under us.
+			 */
+			if (unlikely(pte_alloc(vma->vm_mm, pmdp, addr))) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
+			pmd = *pmdp;
 		}
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
@@ -356,10 +437,14 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 
 			if (pmd_protnone(pmd)) {
 				hmm_pfns_clear(&pfns[i], addr, next);
+				if (fault)
+					goto fault;
 				continue;
 			}
 			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
 			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			if ((flags & fault) != fault)
+				goto fault;
 			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
 				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
 			continue;
@@ -370,41 +455,63 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t pte = *ptep;
 
-			pfns[i] = 0;
-
 			if (pte_none(pte)) {
+				if (fault) {
+					pte_unmap(ptep);
+					goto fault;
+				}
 				pfns[i] = HMM_PFN_EMPTY;
 				continue;
 			}
 
 			entry = pte_to_swp_entry(pte);
 			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				if (fault) {
+					pte_unmap(ptep);
+					goto fault;
+				}
+				pfns[i] = 0;
 				continue;
 			}
 
 			if (pte_present(pte)) {
 				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
 				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-				continue;
-			}
-
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			*/
-			if (is_device_entry(entry)) {
+			} else if (is_device_entry(entry)) {
+				/* Do not fault device entry */
 				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_entry(entry))
 					pfns[i] |= HMM_PFN_WRITE;
 				pfns[i] |= HMM_PFN_DEVICE;
 				pfns[i] |= HMM_PFN_UNADDRESSABLE;
 				pfns[i] |= flag;
-			} else if (!is_migration_entry(entry)) {
+			} else if (is_migration_entry(entry) && fault) {
+				migration_entry_wait(vma->vm_mm, pmdp, addr);
+				/* Start again for current address */
+				next = addr;
+				ptep++;
+				break;
+			} else {
+				/* Report error for everything else */
 				pfns[i] = HMM_PFN_ERROR;
 			}
+			if ((fault & pfns[i]) != fault) {
+				pte_unmap(ptep);
+				goto fault;
+			}
 		}
 		pte_unmap(ptep - 1);
+		continue;
+
+fault:
+		ret = hmm_vma_do_fault(vma, fault, addr, &pfns[i], block);
+		if (ret)
+			return ret;
+		/* Start again for current address */
+		next = addr;
 	}
+
+	return 0;
 }
 
 /*
@@ -463,7 +570,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
-	hmm_vma_walk(vma, start, end, pfns);
+	hmm_vma_walk(vma, 0, start, end, pfns, false);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -474,14 +581,22 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  * @range: range being track
  * Returns: false if range data have been invalidated, true otherwise
  *
- * Range struct is use to track update to CPU page table after call to
- * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
- * to data it gots from this function it calls hmm_vma_range_done() which stop
- * the tracking.
+ * Range struct is use to track update to CPU page table after call to either
+ * hmm_vma_get_pfns() or hmm_vma_fault(). Once device driver is done using or
+ * want to lock update to data it gots from those functions it must call the
+ * hmm_vma_range_done() function which stop tracking CPU page table update.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be use while trying to duplicate CPU page table content for a range of
+ * virtual address.
  *
  * There is 2 way to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -492,7 +607,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -521,4 +636,102 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARN !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	hmm_pfn_t fault = HMM_PFN_READ | (write ? HMM_PFN_WRITE : 0);
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have register a mirror (with hmm_mirror_register()) ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	ret = hmm_vma_walk(vma, fault, start, end, pfns, block);
+	if (ret)
+		hmm_vma_range_done(vma, range);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 10/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/migrate.c | 11 ++++++++++-
 mm/rmap.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 0ed24b1..5de87d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -40,6 +40,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -248,7 +249,15 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		pte = arch_make_huge_pte(pte, vma, new, 0);
 	}
 #endif
-	flush_dcache_page(new);
+
+	if (unlikely(is_zone_device_page(new)) && !is_addressable_page(new)) {
+		entry = make_device_entry(new, pte_write(pte));
+		pte = swp_entry_to_pte(entry);
+		if (pte_swp_soft_dirty(*ptep))
+			pte = pte_mksoft_dirty(pte);
+	} else
+		flush_dcache_page(new);
+
 	set_pte_at(mm, addr, ptep, pte);
 
 	if (PageHuge(new)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 91619fd..c7b0b54 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -61,6 +61,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1454,6 +1455,52 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			goto out;
 	}
 
+	if ((flags & TTU_MIGRATION) && is_zone_device_page(page)) {
+		swp_entry_t entry;
+		pte_t swp_pte;
+		pmd_t *pmdp;
+
+		if (!dev_page_allow_migrate(page))
+			goto out;
+
+		pmdp = mm_find_pmd(mm, address);
+		if (!pmdp)
+			goto out;
+
+		pte = pte_offset_map_lock(mm, pmdp, address, &ptl);
+		if (!pte)
+			goto out;
+
+		pteval = ptep_get_and_clear(mm, address, pte);
+		if (pte_present(pteval) || pte_none(pteval)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		entry = pte_to_swp_entry(pteval);
+		if (!is_device_entry(entry)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		if (device_entry_to_page(entry) != page) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		entry = make_migration_entry(page, 0);
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(*pte))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(mm, address, pte, swp_pte);
+		goto discard;
+	}
+
 	pte = page_check_address(page, mm, address, &ptl,
 				 PageTransCompound(page));
 	if (!pte)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Allow migration without copy in case destination page already have
source page content. This is usefull for HMM migration to device
where we copy page before doing the final migration step.

This feature need carefull audit of filesystem code to make sure
that no one can write to the source page while it is unmapped and
locked. It should be safe for most filesystem but as precaution
return error until support for device migration is added to them.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/staging/lustre/lustre/llite/rw26.c |  8 +++--
 fs/aio.c                                   |  7 +++-
 fs/btrfs/disk-io.c                         | 11 ++++--
 fs/hugetlbfs/inode.c                       |  9 +++--
 fs/nfs/internal.h                          |  5 +--
 fs/nfs/write.c                             |  9 +++--
 fs/ubifs/file.c                            |  8 ++++-
 include/linux/balloon_compaction.h         |  3 +-
 include/linux/fs.h                         | 13 ++++---
 include/linux/migrate.h                    |  7 ++--
 mm/balloon_compaction.c                    |  2 +-
 mm/migrate.c                               | 56 +++++++++++++++++++-----------
 mm/zsmalloc.c                              | 12 ++++++-
 13 files changed, 106 insertions(+), 44 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index 26f3a37..5a225ca 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -43,6 +43,7 @@
 #include <linux/uaccess.h>
 
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/mpage.h>
@@ -635,9 +636,12 @@ static int ll_write_end(struct file *file, struct address_space *mapping,
 #ifdef CONFIG_MIGRATION
 static int ll_migratepage(struct address_space *mapping,
 			  struct page *newpage, struct page *page,
-			  enum migrate_mode mode
-		)
+			  enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/* Always fail page migration until we have a proper implementation */
 	return -EIO;
 }
diff --git a/fs/aio.c b/fs/aio.c
index 428484f..30cf06c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -37,6 +37,7 @@
 #include <linux/blkdev.h>
 #include <linux/compat.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
@@ -366,13 +367,17 @@ static const struct file_operations aio_ring_fops = {
 
 #if IS_ENABLED(CONFIG_MIGRATION)
 static int aio_migratepage(struct address_space *mapping, struct page *new,
-			struct page *old, enum migrate_mode mode)
+			   struct page *old, enum migrate_mode mode, bool copy)
 {
 	struct kioctx *ctx;
 	unsigned long flags;
 	pgoff_t idx;
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(new))
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3a57f99..6ccd3c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -27,6 +27,7 @@
 #include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ratelimit.h>
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
@@ -1046,9 +1047,13 @@ static int btree_submit_bio_hook(struct inode *inode, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page,
-			enum migrate_mode mode)
+			     struct page *newpage, struct page *page,
+			     enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * we can't safely write a btree page from here,
 	 * we haven't done the locking hook
@@ -1062,7 +1067,7 @@ static int btree_migratepage(struct address_space *mapping,
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 4fb7b10..b52dd44 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -35,6 +35,7 @@
 #include <linux/security.h>
 #include <linux/magic.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/uio.h>
 
 #include <asm/uaccess.h>
@@ -842,11 +843,15 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 }
 
 static int hugetlbfs_migrate_page(struct address_space *mapping,
-				struct page *newpage, struct page *page,
-				enum migrate_mode mode)
+				  struct page *newpage, struct page *page,
+				  enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 80bcc0b..12d9d8d 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -535,8 +535,9 @@ void nfs_clear_pnfs_ds_commit_verifiers(struct pnfs_ds_commit_info *cinfo)
 #endif
 
 #ifdef CONFIG_MIGRATION
-extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *, enum migrate_mode);
+extern int nfs_migrate_page(struct address_space *mapping,
+			    struct page *newpage, struct page *page,
+			    enum migrate_mode, bool copy);
 #endif
 
 static inline int
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5321183..d7130a5 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 #include <linux/sunrpc/clnt.h>
 #include <linux/nfs_fs.h>
@@ -2023,8 +2024,12 @@ int nfs_wb_single_page(struct inode *inode, struct page *page, bool launder)
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		     struct page *page, enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
 	 * an in-progress read or write request. Don't try to migrate it.
@@ -2039,7 +2044,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 	if (!nfs_fscache_release_page(page, GFP_KERNEL))
 		return -EBUSY;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index b4fbeef..f625cac 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -53,6 +53,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 static int read_block(struct inode *inode, void *addr, unsigned int block,
 		      struct ubifs_data_node *dn)
@@ -1455,10 +1456,15 @@ static int ubifs_set_page_dirty(struct page *page)
 
 #ifdef CONFIG_MIGRATION
 static int ubifs_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			      struct page *newpage, struct page *page,
+			      enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 79542b2..27cf3e3 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -85,7 +85,8 @@ extern bool balloon_page_isolate(struct page *page,
 extern void balloon_page_putback(struct page *page);
 extern int balloon_page_migrate(struct address_space *mapping,
 				struct page *newpage,
-				struct page *page, enum migrate_mode mode);
+				struct page *page, enum migrate_mode mode,
+				bool copy);
 
 /*
  * balloon_page_insert - insert a page into the balloon's page list and make
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2f63d44..431f0d3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -398,8 +398,9 @@ struct address_space_operations {
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
 	 */
-	int (*migratepage) (struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+	int (*migratepage)(struct address_space *mapping,
+			   struct page *newpage, struct page *page,
+			   enum migrate_mode, bool copy);
 	bool (*isolate_page)(struct page *, isolate_mode_t);
 	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
@@ -3010,9 +3011,11 @@ extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
 extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *,
-				enum migrate_mode);
+extern int buffer_migrate_page(struct address_space *mapping,
+			       struct page *newpage,
+			       struct page *page,
+			       enum migrate_mode,
+			       bool copy);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ae8d475..37b77ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -33,8 +33,11 @@ extern char *migrate_reason_names[MR_TYPES];
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
-extern int migrate_page(struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+extern int migrate_page(struct address_space *mapping,
+			struct page *newpage,
+			struct page *page,
+			enum migrate_mode,
+			bool copy);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
 extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..ed5cacb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -135,7 +135,7 @@ void balloon_page_putback(struct page *page)
 /* move_to_new_page() counterpart for a ballooned page */
 int balloon_page_migrate(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		enum migrate_mode mode, bool copy)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 5de87d5..36e2ed9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -622,18 +622,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 	}
 }
 
-/*
- * Copy the page to its new location
- */
-void migrate_page_copy(struct page *newpage, struct page *page)
+static void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -687,6 +679,19 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -700,8 +705,8 @@ EXPORT_SYMBOL(migrate_page_copy);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		 struct page *newpage, struct page *page,
+		 enum migrate_mode mode, bool copy)
 {
 	int rc;
 
@@ -712,7 +717,11 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
+
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -724,13 +733,14 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			struct page *newpage, struct page *page,
+			enum migrate_mode mode, bool copy)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page, mode);
+		return migrate_page(mapping, newpage, page, mode, copy);
 
 	head = page_buffers(page);
 
@@ -762,12 +772,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -822,7 +835,8 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page, enum migrate_mode mode)
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
@@ -839,7 +853,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, true);
 }
 
 /*
@@ -867,7 +881,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	if (likely(is_lru)) {
 		if (!mapping)
-			rc = migrate_page(mapping, newpage, page, mode);
+			rc = migrate_page(mapping, newpage, page, mode, true);
 		else if (mapping->a_ops->migratepage)
 			/*
 			 * Most pages have a mapping and most filesystems
@@ -877,7 +891,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 			 * for page migration.
 			 */
 			rc = mapping->a_ops->migratepage(mapping, newpage,
-							page, mode);
+							page, mode, true);
 		else
 			rc = fallback_migrate_page(mapping, newpage,
 							page, mode);
@@ -894,7 +908,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		}
 
 		rc = mapping->a_ops->migratepage(mapping, newpage,
-						page, mode);
+						page, mode, true);
 		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
 			!PageIsolated(page));
 	}
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b0bc023..bf73222 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -52,6 +52,7 @@
 #include <linux/zpool.h>
 #include <linux/mount.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/pagemap.h>
 
 #define ZSPAGE_MAGIC	0x58
@@ -2015,7 +2016,7 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 }
 
 int zs_page_migrate(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		    struct page *page, enum migrate_mode mode, bool copy)
 {
 	struct zs_pool *pool;
 	struct size_class *class;
@@ -2033,6 +2034,15 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
+	/*
+	 * Offloading copy operation for zspage require special considerations
+	 * due to locking so for now we only support regular migration. I do
+	 * not expect we will ever want to support offloading copy. See hmm.h
+	 * for more informations on hmm_vma_migrate() and offload copy.
+	 */
+	if (!copy || !is_addressable_page(newpage))
+		return -EINVAL;
+
 	zspage = get_zspage(page);
 
 	/* Concurrent compactor cannot migrate any subpage in zspage */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46   ` David Nellans
  2017-01-06 16:46 ` [HMM v15 14/16] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be usefull for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changed since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  66 +++++++-
 mm/Kconfig          |  13 ++
 mm/migrate.c        | 460 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 536 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f19c2a0..b1de4e1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,10 +88,13 @@ struct hmm;
  * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
  * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_LOCKED: underlying struct page is lock
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
  *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
  *      device (the entry will never have HMM_PFN_VALID set and the pfn value
  *      is undefine)
+ * HMM_PFN_MIGRATE: use by hmm_vma_migrate() to signify which address can be
+ *      migrated
  * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
@@ -102,9 +105,11 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_ERROR (1 << 3)
 #define HMM_PFN_EMPTY (1 << 4)
 #define HMM_PFN_DEVICE (1 << 5)
-#define HMM_PFN_SPECIAL (1 << 6)
-#define HMM_PFN_UNADDRESSABLE (1 << 7)
-#define HMM_PFN_SHIFT 8
+#define HMM_PFN_LOCKED (1 << 6)
+#define HMM_PFN_SPECIAL (1 << 7)
+#define HMM_PFN_MIGRATE (1 << 8)
+#define HMM_PFN_UNADDRESSABLE (1 << 9)
+#define HMM_PFN_SHIFT 10
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -317,6 +322,61 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_MIGRATE)
+/*
+ * struct hmm_migrate_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memoiry and copy source to it
+ * @finalize_and_map: allow caller to inspect successfull migrated page
+ *
+ * The new HMM migrate helper hmm_vma_migrate() allow memory migration to use
+ * device DMA engine to perform copy from source to destination memory it also
+ * allow caller to use its own memory allocator for destination memory.
+ *
+ * Note that in alloc_and_copy device driver can decide not to migrate some of
+ * the entry by simply setting corresponding dst_pfns to 0.
+ *
+ * Destination page must locked and HMM_PFN_LOCKED flag set in corresponding
+ * hmm_pfn_t entry of dst_pfns array. It is expected that page allocated will
+ * have an elevated refcount and that a put_page() will free the page.
+ *
+ * Device driver might want to allocate with an extra-refcount if they want to
+ * control deallocation of failed migration inside finalize_and_map() callback.
+ *
+ * Inside finalize_and_map() device driver must use the HMM_PFN_MIGRATE flag to
+ * determine which page have been successfully migrated (this is set inside the
+ * src_pfns array).
+ *
+ * For migration from device memory to system memory device driver must set any
+ * dst_pfns entry to HMM_PFN_ERROR for any entry it can not migrate back due to
+ * hardware fatal failure that can not be recovered. Such failure will trigger
+ * a SIGBUS for the process trying to access such memory.
+ */
+struct hmm_migrate_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const hmm_pfn_t *src_pfns,
+			       hmm_pfn_t *dst_pfns,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const hmm_pfn_t *src_pfns,
+				 hmm_pfn_t *dst_pfns,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
+		    struct vm_area_struct *vma,
+		    hmm_pfn_t *src_pfns,
+		    hmm_pfn_t *dst_pfns,
+		    unsigned long start,
+		    unsigned long end,
+		    void *private);
+#endif /* IS_ENABLED(CONFIG_HMM_MIGRATE) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 598c38a..3806d69 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -308,6 +308,19 @@ config HMM_MIRROR
 	  range of virtual address. This require careful synchronization with
 	  CPU page table update.
 
+config HMM_MIGRATE
+	bool "HMM migrate virtual range of process using device driver DMA"
+	select HMM
+	select MIGRATION
+	help
+	  HMM migrate is a new helper to migrate range of virtual address using
+	  special page allocator and copy callback. This allow device driver to
+	  migrate range of a process memory to its memory using its DMA engine.
+
+	  It obyes all rules of memory migration, except that it supports the
+	  migration of ZONE_DEVICE page that have MEMOY_DEVICE_ALLOW_MIGRATE
+	  flag set.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 36e2ed9..365b615 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
 #include <linux/memremap.h>
+#include <linux/hmm.h>
 
 #include <asm/tlbflush.h>
 
@@ -421,6 +422,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2087,3 +2096,454 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+#if IS_ENABLED(CONFIG_HMM_MIGRATE)
+struct hmm_migrate {
+	struct vm_area_struct	*vma;
+	hmm_pfn_t		*dst_pfns;
+	hmm_pfn_t		*src_pfns;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int hmm_collect_walk_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_migrate *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	hmm_pfn_t *src_pfns;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+again:
+	if (pmd_none(*pmdp))
+		return 0;
+
+	split_huge_pmd(walk->vma, pmdp, addr);
+	if (pmd_trans_unstable(pmdp))
+		goto again;
+
+	src_pfns = &migrate->src_pfns[(addr - migrate->start) >> PAGE_SHIFT];
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+
+	for (; addr < end; addr += PAGE_SIZE, src_pfns++, ptep++) {
+		unsigned long pfn;
+		swp_entry_t entry;
+		struct page *page;
+		hmm_pfn_t flags;
+		bool write;
+		pte_t pte;
+
+		pte = *ptep;
+
+		if (!pte_present(pte)) {
+			if (pte_none(pte))
+				continue;
+
+			/*
+			 * Only care about un-addressable device page special
+			 * page table entry. Other special swap entry are not
+			 * migratable and we ignore regular swaped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_entry(entry))
+				continue;
+
+			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
+			write = is_write_device_entry(entry);
+			page = device_entry_to_page(entry);
+			pfn = page_to_pfn(page);
+
+			if (!dev_page_allow_migrate(page))
+				continue;
+		} else {
+			pfn = pte_pfn(pte);
+			write = pte_write(pte);
+			page = pfn_to_page(pfn);
+			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
+		}
+
+		/* FIXME support THP see hmm_migrate_page_check() */
+		if (PageTransCompound(page))
+			continue;
+
+		/*
+		 * Corner case handling:
+		 * 1. When a new swap-cache page is read into, it is added to
+		 * the LRU and treated as swapcache but it has no rmap yet. Skip
+		 * those.
+		 */
+		if (!page->mapping)
+			continue;
+
+		*src_pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
+		*src_pfns |= write ? HMM_PFN_WRITE : 0;
+		migrate->npages++;
+
+		/*
+		 * By getting a reference on the page we pin it and blocks any
+		 * kind of migration. Side effect is that it "freeze" the pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be drop from it).
+		 */
+		get_page(page);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * hmm_migrate_collect() - collect page over range of virtual address
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This will go over the CPU page table and for each virtual address back by a
+ * valid page it update the src_pfns array and take a reference on the page in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void hmm_migrate_collect(struct hmm_migrate *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = hmm_collect_walk_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
+}
+
+/*
+ * hmm_migrate_page_check() - check if page is pin or not
+ * @page: struct page to check
+ *
+ * Pinned page can not be migrated. Same test in migrate_page_move_mapping()
+ * except that here we allow migration of ZONE_DEVICE page.
+ */
+static inline bool hmm_migrate_page_check(struct page *page)
+{
+	/*
+	 * One extra ref because caller hold an extra reference either from
+	 * either isolate_lru_page() for regular page or hmm_migrate_collect()
+	 * for device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them then regular page because they can be map with a pmd or
+	 * with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		if (!dev_page_allow_migrate(page))
+			return false;
+		extra++;
+	}
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * hmm_migrate_lock_and_isolate() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This lock pages that have been collected by hmm_migrate_collect(). Once page
+ * is locked it is isolated from the lru (for non device page). Finaly the ref
+ * taken by hmm_migrate_collect() is drop as locked page can not be migrated by
+ * concurrent kernel thread.
+ */
+static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (; (addr<migrate->end) && migrate->npages; addr+=PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src_pfns[i] |= HMM_PFN_LOCKED;
+
+		/* ZONE_DEVICE page are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
+
+			if (isolate_lru_page(page)) {
+				migrate->src_pfns[i] = 0;
+				migrate->npages--;
+				unlock_page(page);
+				put_page(page);
+			} else
+				/* Drop the reference we took in collect */
+				put_page(page);
+		}
+
+		if (!hmm_migrate_page_check(page)) {
+			migrate->src_pfns[i] = 0;
+			migrate->npages--;
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+}
+
+/*
+ * hmm_migrate_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration informations
+ *
+ * Replace page mapping (CPU page table pte) with special migration pte entry
+ * and check again if it has be pin. Pin page are restore because we can not
+ * migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy content of original page over to new page.
+ */
+static void hmm_migrate_unmap(struct hmm_migrate *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	unsigned long addr = migrate->start, i = 0, restore = 0;
+
+	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page || !(migrate->src_pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !hmm_migrate_page_check(page)) {
+			migrate->src_pfns[i] &= ~HMM_PFN_MIGRATE;
+			migrate->npages--;
+			restore++;
+		}
+	}
+
+	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page || (migrate->src_pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src_pfns[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
+	}
+}
+
+/*
+ * hmm_migrate_struct_page() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This migrate struct page meta-data from source struct page to destination
+ * struct page. This effectively finish the migration from source page to the
+ * destination page.
+ */
+static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+
+	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
+		struct page *newpage = hmm_pfn_to_page(migrate->dst_pfns[i]);
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src_pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		/*
+		 * For now only support private anonymous when migrating
+		 * to un-addressable device memory.
+		 */
+		if (mapping && is_zone_device_page(newpage) &&
+		    !is_addressable_page(newpage)) {
+			migrate->src_pfns[i] &= ~HMM_PFN_MIGRATE;
+			continue;
+		}
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC, false);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src_pfns[i] &= ~HMM_PFN_MIGRATE;
+	}
+}
+
+/*
+ * hmm_migrate_remove_migration_pte() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This replace the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page or to the original page
+ * otherwise.
+ *
+ * This also unlock the page and put them back on the lru or drop the extra
+ * ref for device page.
+ */
+static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+
+	for (; (addr<migrate->end) && migrate->npages; addr+=PAGE_SIZE, i++) {
+		struct page *newpage = hmm_pfn_to_page(migrate->dst_pfns[i]);
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page)
+			continue;
+		newpage = newpage ? newpage : page;
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->npages--;
+
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @src_pfns: array of hmm_pfn_t containing source pfns
+ * @dst_pfns: array of hmm_pfn_t containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This will try to migrate a range of memory using callback to allocate and
+ * copy memory from source to destination. This function will first collect,
+ * lock and unmap pages in the range and then call alloc_and_copy() callback
+ * for device driver to allocate destination memory and copy from source.
+ *
+ * Then it will proceed and try to effectively migrate the page (struct page
+ * metadata) a step that can fail for various reasons. Before updating CPU page
+ * table it will call finalize_and_map() callback so that device driver can
+ * inspect what have been successfully migrated and update its own page table
+ * (this latter aspect is not mandatory and only make sense for some user of
+ * this API).
+ *
+ * Finaly the function update CPU page table and unlock the pages before
+ * returning 0.
+ *
+ * It will return an error code only if one of the argument is invalid.
+ */
+int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
+		    struct vm_area_struct *vma,
+		    hmm_pfn_t *src_pfns,
+		    hmm_pfn_t *dst_pfns,
+		    unsigned long start,
+		    unsigned long end,
+		    void *private)
+{
+	struct hmm_migrate migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (!vma || !ops || !src_pfns || !dst_pfns || start >= end)
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	memset(src_pfns, 0, sizeof(*src_pfns) * ((end - start) >> PAGE_SHIFT));
+	migrate.src_pfns = src_pfns;
+	migrate.dst_pfns = dst_pfns;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	hmm_migrate_collect(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/* Lock and isolate page */
+	hmm_migrate_lock_and_isolate(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/* Unmap pages */
+	hmm_migrate_unmap(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/*
+	 * At this point pages are lock and unmap and thus they have stable
+	 * content and can safely be copied to destination memory that is
+	 * allocated by the callback.
+	 *
+	 * Note that migration can fail in hmm_migrate_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src_pfns, dst_pfns, start, end, private);
+
+	/* This does the real migration of struct page */
+	hmm_migrate_struct_page(&migrate);
+
+	ops->finalize_and_map(vma, src_pfns, dst_pfns, start, end, private);
+
+	/* Unlock and remap pages */
+	hmm_migrate_remove_migration_pte(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_migrate);
+#endif /* IS_ENABLED(CONFIG_HMM_MIGRATE) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 14/16] mm/hmm/migrate: optimize page map once in vma being migrated
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 170 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 365b615..a256f68 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2116,6 +2116,7 @@ static int hmm_collect_walk_pmd(pmd_t *pmdp,
 	struct hmm_migrate *migrate = walk->private;
 	struct mm_struct *mm = walk->vma->vm_mm;
 	unsigned long addr = start;
+	unsigned long unmaped = 0;
 	hmm_pfn_t *src_pfns;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -2130,6 +2131,7 @@ static int hmm_collect_walk_pmd(pmd_t *pmdp,
 
 	src_pfns = &migrate->src_pfns[(addr - migrate->start) >> PAGE_SHIFT];
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
 
 	for (; addr < end; addr += PAGE_SIZE, src_pfns++, ptep++) {
 		unsigned long pfn;
@@ -2194,9 +2196,44 @@ static int hmm_collect_walk_pmd(pmd_t *pmdp,
 		 * can't be drop from it).
 		 */
 		get_page(page);
+
+		/*
+		 * Optimize for common case where page is only map once in one
+		 * process. If we can lock the page then we can safely setup
+		 * special migration page table entry now.
+		 */
+		if (!trylock_page(page)) {
+			set_pte_at(mm, addr, ptep, pte);
+		} else {
+			pte_t swp_pte;
+
+			*src_pfns |= HMM_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, write);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regulat unmap we remove the rmap and
+			 * drop page refcount. Page won't be free as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmaped++;
+		}
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmaped)
+		flush_tlb_range(walk->vma, start, end);
+
 	return 0;
 }
 
@@ -2279,18 +2316,26 @@ static inline bool hmm_migrate_page_check(struct page *page)
 static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
 {
 	unsigned long addr = migrate->start, i = 0;
+ 	struct mm_struct *mm = migrate->vma->vm_mm;
+ 	struct vm_area_struct *vma = migrate->vma;
+	hmm_pfn_t *src_pfns = migrate->src_pfns;
+	unsigned long restore = 0;
 	bool allow_drain = true;
 
 	lru_add_drain();
 
 	for (; (addr<migrate->end) && migrate->npages; addr+=PAGE_SIZE, i++) {
-		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+		struct page *page = hmm_pfn_to_page(src_pfns[i]);
+		bool need_restore = true;
 
 		if (!page)
 			continue;
 
-		lock_page(page);
-		migrate->src_pfns[i] |= HMM_PFN_LOCKED;
+		if (!(src_pfns[i] & HMM_PFN_LOCKED)) {
+			lock_page(page);
+			need_restore = false;
+			src_pfns[i] |= HMM_PFN_LOCKED;
+		}
 
 		/* ZONE_DEVICE page are not on LRU */
 		if (!is_zone_device_page(page)) {
@@ -2301,20 +2346,135 @@ static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
 			}
 
 			if (isolate_lru_page(page)) {
-				migrate->src_pfns[i] = 0;
-				migrate->npages--;
-				unlock_page(page);
-				put_page(page);
+				if (need_restore) {
+					src_pfns[i] &= ~HMM_PFN_MIGRATE;
+					restore++;
+				} else {
+					migrate->npages--;
+					unlock_page(page);
+					src_pfns[i] = 0;
+					put_page(page);
+				}
 			} else
 				/* Drop the reference we took in collect */
 				put_page(page);
 		}
 
 		if (!hmm_migrate_page_check(page)) {
-			migrate->src_pfns[i] = 0;
-			migrate->npages--;
+			if (need_restore) {
+				src_pfns[i] &= ~HMM_PFN_MIGRATE;
+				restore++;
+			} else {
+				migrate->npages--;
+				unlock_page(page);
+				src_pfns[i] = 0;
+				put_page(page);
+			}
+		}
+	}
+
+	if (!restore)
+		return;
+
+	for (addr = migrate->start, i = 0; (addr < migrate->end) && restore;) {
+		struct page *page = hmm_pfn_to_page(src_pfns[i]);
+		unsigned long next, restart;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		if (!page || !(src_pfns[i] & HMM_PFN_MIGRATE)) {
+			addr += PAGE_SIZE;
+			i++;
+			continue;
+		}
+
+		restart = addr;
+
+		/*
+		 * Some one might have zap the mapping. Truncate should be only
+		 * case for which this might happen while holding mmap_sem.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		next = pgd_addr_end(addr, migrate->end);
+		if (!pgdp || pgd_none_or_clear_bad(pgdp))
+			goto unlock_release;
+		pudp = pud_offset(pgdp, addr);
+		next = pud_addr_end(addr, migrate->end);
+		if (!pudp || pud_none(*pudp))
+			goto unlock_release;
+		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(addr, migrate->end);
+		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp))
+			goto unlock_release;
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			bool write;
+			pte_t pte;
+
+			page = hmm_pfn_to_page(src_pfns[i]);
+			if (!page || (src_pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+
+			write = src_pfns[i] & HMM_PFN_WRITE;
+			write &= (vma->vm_flags & VM_WRITE);
+
+			/* Here it means pte must be a valid migration entry */
+			pte = ptep_get_and_clear(mm, addr, ptep);
+			if (pte_none(pte) || pte_present(pte)) {
+				/* SOMETHING BAD IS GOING ON ! */
+				set_pte_at(mm, addr, ptep, pte);
+				continue;
+			}
+			entry = pte_to_swp_entry(pte);
+			if (!is_migration_entry(entry)) {
+				/* SOMETHING BAD IS GOING ON ! */
+				set_pte_at(mm, addr, ptep, pte);
+				continue;
+			}
+
+			if (is_zone_device_page(page) &&
+			    !is_addressable_page(page)) {
+				entry = make_device_entry(page, write);
+				pte = swp_entry_to_pte(entry);
+			} else {
+				pte = mk_pte(page, vma->vm_page_prot);
+				pte = pte_mkold(pte);
+				if (write)
+					pte = pte_mkwrite(pte);
+			}
+			if (pte_swp_soft_dirty(*ptep))
+				pte = pte_mksoft_dirty(pte);
+
+			get_page(page);
+			set_pte_at(mm, addr, ptep, pte);
+			if (PageAnon(page))
+				page_add_anon_rmap(page, vma, addr, false);
+			else
+				page_add_file_rmap(page, false);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+unlock_release:
+		addr = restart;
+		i = (addr - migrate->start) >> PAGE_SHIFT;
+		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
+			page = hmm_pfn_to_page(src_pfns[i]);
+			if (!page || (src_pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+
+			src_pfns[i] = 0;
 			unlock_page(page);
-			put_page(page);
+			restore--;
+
+			if (is_zone_device_page(page))
+				put_page(page);
+			else
+				putback_lru_page(page);
 		}
 	}
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 14/16] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 16:46 ` [HMM v15 16/16] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
  2017-01-06 20:54 ` [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Dave Hansen
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 116 ++++++++++++++++++++++++
 mm/Kconfig          |   7 ++
 mm/hmm.c            | 251 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 374 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b1de4e1..674aa79 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,10 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -377,6 +381,118 @@ int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
 #endif /* IS_ENABLED(CONFIG_HMM_MIGRATE) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ * @inuse: is struct in use
+ *
+ * This an helper structure for device driver that do not wish to implement
+ * to gory details related to hotpluging new memoy and in allocating struct
+ * pages.
+ *
+ * Device driver can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct dev_pagemap		*pagemap;
+	struct device			*device;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+	bool				inuse;
+};
+
+/*
+ * To add (hotplug) device memory, it assumes that there is no real resource
+ * that reserve a range in the physical address space (this is intended to be
+ * use by un-addressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * Device driver can wrap the hmm_devmem struct inside a private device driver
+ * struct. Device driver must call hmm_devmem_remove() before device goes away
+ * and before freeing the hmm_devmem struct memory.
+ */
+int hmm_devmem_add(struct hmm_devmem *devmem,
+		   const struct hmm_devmem_ops *ops,
+		   struct device *device,
+		   unsigned long size);
+bool hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct hmm_migrate_ops *ops,
+			   hmm_pfn_t *src_pfns,
+			   hmm_pfn_t *dst_pfns,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 3806d69..0d9cdcf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -321,6 +321,13 @@ config HMM_MIGRATE
 	  migration of ZONE_DEVICE page that have MEMOY_DEVICE_ALLOW_MIGRATE
 	  flag set.
 
+config HMM_DEVMEM
+	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+	select HMM
+	help
+	  HMM devmem are helpers to leverage new ZONE_DEVICE feature. This is
+	  just to avoid device driver to replicate boiler plate code.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index a397d45..a9af1bc 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/mmu_notifier.h>
 
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
 
 /*
  * struct hmm - HMM per mm struct
@@ -735,3 +740,249 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+	devmem->inuse = false;
+}
+
+static void hmm_devmem_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, hmm_devmem_exit, data);
+}
+
+static void hmm_devmem_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	devmem->inuse = false;
+	percpu_ref_kill(ref);
+	devm_remove_action(devmem->device, hmm_devmem_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+/*
+ * hmm_devmem_add() - hotplug fake ZONE_DEVICE memory for device memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: 0 on success, error code otherwise
+ *
+ * This first find an empty range of physical address big enough to for the new
+ * resource and then hotplug it as ZONE_DEVICE memory allocating struct page.
+ * It does not do anything beside that, all events affecting the memory will go
+ * through the various callback provided by hmm_devmem_ops struct.
+ */
+int hmm_devmem_add(struct hmm_devmem *devmem,
+		   const struct hmm_devmem_ops *ops,
+		   struct device *device,
+		   unsigned long size)
+{
+	const struct resource *res;
+	resource_size_t addr;
+	void *ptr;
+	int ret;
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->pagemap = NULL;
+	devmem->inuse = false;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref,&hmm_devmem_release,0,GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	ret = devm_add_action(device, hmm_devmem_exit, &devmem->ref);
+	if (ret)
+		goto error;
+
+	size = ALIGN(size, SECTION_SIZE);
+	addr = (1UL << MAX_PHYSMEM_BITS) - size;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error;
+		}
+		devmem->resource->desc = IORES_DESC_UNADDRESSABLE_MEMORY;
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error;
+	}
+
+	ptr = devm_memremap_pages(device, devmem->resource, &devmem->ref,
+				  NULL, &devmem->pagemap,
+				  hmm_devmem_fault, hmm_devmem_free, devmem,
+				  MEMORY_DEVICE | MEMORY_DEVICE_ALLOW_MIGRATE |
+				  MEMORY_DEVICE_UNADDRESSABLE);
+	if (IS_ERR(ptr)) {
+		ret = PTR_ERR(ptr);
+		goto error;
+	}
+
+	ret = devm_add_action(device, hmm_devmem_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_kill(&devmem->ref);
+		goto error;
+	}
+
+	res = devmem->pagemap->res;
+	devmem->pfn_first = res->start >> PAGE_SHIFT;
+	devmem->pfn_last = (resource_size(res)>>PAGE_SHIFT)+devmem->pfn_first;
+	devmem->inuse = true;
+
+	return 0;
+
+error:
+	hmm_devmem_exit(&devmem->ref);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * Returns: true if device memory is no longer in use, false if still in use
+ *
+ * This will hot remove memory that was hotplug by hmm_devmem_add on behalf of
+ * device driver. It will free struct page and remove the resource that reserve
+ * the physical address range for this device memory.
+ *
+ * Device driver can not free the struct while this function return false, it
+ * must call over and over this function until it returns true. Note that if
+ * there is a refcount bug this might never happen !
+ */
+bool hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	struct device *device = devmem->device;
+
+	hmm_devmem_kill(&devmem->ref);
+
+	if (devmem->pagemap) {
+		devm_memremap_pages_remove(device, devmem->pagemap);
+		devmem->pagemap = NULL;
+	}
+
+	hmm_devmem_exit(&devmem->ref);
+
+	/* FIXME maybe wait a bit ? */
+	if (devmem->inuse)
+		return false;
+
+	if (devmem->resource) {
+		resource_size_t size = resource_size(devmem->resource);
+
+		devm_release_mem_region(device, devmem->resource->start, size);
+		devmem->resource = NULL;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @src_pfns: array of hmm_pfn_t containing source pfns
+ * @dst_pfns: array of hmm_pfn_t containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around hmm_vma_migrate() which check the migration status
+ * for a given fault address and return corresponding page fault handler status
+ * ie 0 on success or VM_FAULT_SIGBUS if migration failed for fault address.
+ *
+ * This is an helper intendend to be use by ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct hmm_migrate_ops *ops,
+			   hmm_pfn_t *src_pfns,
+			   hmm_pfn_t *dst_pfns,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private)
+{
+	if (hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private))
+		return VM_FAULT_SIGBUS;
+
+	if (dst_pfns[(addr - start) >> PAGE_SHIFT] & HMM_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [HMM v15 16/16] mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
@ 2017-01-06 16:46 ` Jérôme Glisse
  2017-01-06 20:54 ` [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Dave Hansen
  16 siblings, 0 replies; 30+ messages in thread
From: Jérôme Glisse @ 2017-01-06 16:46 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is usefull to device driver that want to manage multiple physical
device memory under same device umbrella.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 ++++++++++++-
 mm/hmm.c            | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 674aa79..57e88e4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,10 +76,10 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -490,6 +490,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned		minor;
+};
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index a9af1bc..1e29dfd 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mmzone.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
@@ -985,4 +986,98 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt,0,1,"hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+	unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+	class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2
  2017-01-06 16:46   ` David Nellans
@ 2017-01-06 17:13     ` Jerome Glisse
  2017-01-10 15:30       ` David Nellans
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-01-06 17:13 UTC (permalink / raw)
  To: David Nellans
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti, Cameron Buschardt,
	Zi Yan, Anshuman Khandual

On Fri, Jan 06, 2017 at 10:46:09AM -0600, David Nellans wrote:
> 
> 
> On 01/06/2017 10:46 AM, Jérôme Glisse wrote:
> > This patch add a new memory migration helpers, which migrate memory
> > backing a range of virtual address of a process to different memory
> > (which can be allocated through special allocator). It differs from
> > numa migration by working on a range of virtual address and thus by
> > doing migration in chunk that can be large enough to use DMA engine
> > or special copy offloading engine.
> >
> > Expected users are any one with heterogeneous memory where different
> > memory have different characteristics (latency, bandwidth, ...). As
> > an example IBM platform with CAPI bus can make use of this feature
> > to migrate between regular memory and CAPI device memory. New CPU
> > architecture with a pool of high performance memory not manage as
> > cache but presented as regular memory (while being faster and with
> > lower latency than DDR) will also be prime user of this patch.
> Why should the normal page migration path (where neither src nor dest
> are device private), use the hmm_migrate functionality?  11-14 are
> replicating a lot of the normal migration functionality but with special
> casing for HMM requirements.

You are mischaracterizing patch 11-14. Patch 11-12 adds new flags and
modify existing functions so that they can be share. Patch 13 implement
new migration helper while patch 14 optimize this new migration helper.

hmm_migrate() is different from existing migration code because it works
on virtual address range of a process. Existing migration code works
from page. The only difference with existing code is that we collect
pages from virtual address and we allow use of dma engine to perform
copy.

> When migrating THP's or a list of pages (your use case above), normal
> NUMA migration is going to want to do this as fast as possible too (see
> Zi Yan's patches for multi-threading normal migrations & prototype of
> using intel IOAT for transfers, he sees 3-5x speedup).

This is core features of HMM and as such optimization like better THP
support are defer to later patchset.

> 
> If the intention is to provide a common interface hook for migration to
> use DMA acceleration (which is a good idea), it probably shouldn't be
> special cased inside HMM functionality. For example, using the intel IOAT
> for migration DMA has nothing to do with HMM whatsoever. We need a normal
> migration path interface to allow DMA that isn't tied to HMM.

There is nothing that ie hmm_migrate() to HMM. If that make you feel better
i can drop the hmm_ prefix but i would need another name than migrate() as
it is already taken. I can probably name it vma_range_dma_migrate() or
something like that.

The only think that is HMM specific in this code is understanding HMM special
page table entry and handling those. Such entry can only be migrated by DMA
and not by memcpy hence why i do not modify existing code to support those.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2017-01-06 16:46 ` [HMM v15 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
@ 2017-01-06 17:58   ` Dan Williams
  0 siblings, 0 replies; 30+ messages in thread
From: Dan Williams @ 2017-01-06 17:58 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrew Morton, linux-kernel, Linux MM, John Hubbard, Ross Zwisler

On Fri, Jan 6, 2017 at 8:46 AM, Jérôme Glisse <jglisse@redhat.com> wrote:
> Some device driver manage multiple physical devices memory from a single
> fake device driver. In that case the fake device might outlive the real
> device and ZONE_DEVICE and its resource allocated for a real device would
> waste resources in the meantime.
>
> This patch allow early removal of ZONE_DEVICE and associated resource,
> before device driver is tear down.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/memremap.h |  7 +++++++
>  kernel/memremap.c        | 14 ++++++++++++++
>  2 files changed, 21 insertions(+)
>
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index f7e0609..32314d2 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -53,6 +53,7 @@ struct dev_pagemap {
>  void *devm_memremap_pages(struct device *dev, struct resource *res,
>                 struct percpu_ref *ref, struct vmem_altmap *altmap);
>  struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
> +int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
>
>  static inline bool dev_page_allow_migrate(const struct page *page)
>  {
> @@ -78,6 +79,12 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
>         return NULL;
>  }
>
> +static inline int devm_memremap_pages_remove(struct device *dev,
> +                                            struct dev_pagemap *pgmap)
> +{
> +       return -EINVAL;
> +}
> +
>  static inline bool dev_page_allow_migrate(const struct page *page)
>  {
>         return false;
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 07665eb..250ef25 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -387,6 +387,20 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>  }
>  EXPORT_SYMBOL(devm_memremap_pages);
>
> +static int devm_page_map_match(struct device *dev, void *data, void *match_data)
> +{
> +       struct page_map *page_map = data;
> +
> +       return &page_map->pgmap == match_data;
> +}
> +
> +int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap)
> +{
> +       return devres_release(dev, &devm_memremap_pages_release,
> +                             &devm_page_map_match, pgmap);
> +}
> +EXPORT_SYMBOL(devm_memremap_pages_remove);

I think this should be called devm_memunmap_pages() to mirror
devm_memunmap(), and it should take the virtual address returned from
devm_memremap_pages() not pgmap which is an internal detail.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15
  2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
                   ` (15 preceding siblings ...)
  2017-01-06 16:46 ` [HMM v15 16/16] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
@ 2017-01-06 20:54 ` Dave Hansen
  2017-01-06 21:16   ` Jerome Glisse
  16 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2017-01-06 20:54 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm; +Cc: John Hubbard

On 01/06/2017 08:46 AM, Jérôme Glisse wrote:
> I think it is ready for next or at least i would like to know any
> reasons to not accept this patchset.

Do you have a real in-tree user for this yet?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15
  2017-01-06 20:54 ` [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Dave Hansen
@ 2017-01-06 21:16   ` Jerome Glisse
  2017-01-06 21:36     ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-01-06 21:16 UTC (permalink / raw)
  To: Dave Hansen; +Cc: akpm, linux-kernel, linux-mm, John Hubbard, Ben Skeggs

On Fri, Jan 06, 2017 at 12:54:41PM -0800, Dave Hansen wrote:
> On 01/06/2017 08:46 AM, Jérôme Glisse wrote:
> > I think it is ready for next or at least i would like to know any
> > reasons to not accept this patchset.
> 
> Do you have a real in-tree user for this yet?

Nouvau would be the first one we don't have the kernel space bit yet
to support page fault. We definitly plan to have this working with
nouveau, maybe 4.10 or 4.11.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15
  2017-01-06 21:16   ` Jerome Glisse
@ 2017-01-06 21:36     ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2017-01-06 21:36 UTC (permalink / raw)
  To: Dave Hansen; +Cc: akpm, linux-kernel, linux-mm, John Hubbard, Ben Skeggs

On Fri, Jan 06, 2017 at 04:16:20PM -0500, Jerome Glisse wrote:
> On Fri, Jan 06, 2017 at 12:54:41PM -0800, Dave Hansen wrote:
> > On 01/06/2017 08:46 AM, Jérôme Glisse wrote:
> > > I think it is ready for next or at least i would like to know any
> > > reasons to not accept this patchset.
> > 
> > Do you have a real in-tree user for this yet?
> 
> Nouvau would be the first one we don't have the kernel space bit yet
> to support page fault. We definitly plan to have this working with
> nouveau, maybe 4.10 or 4.11.
> 

Sorry meant 4.11, 4.12 ...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2017-01-06 16:46 ` [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
@ 2017-01-09  9:19   ` Balbir Singh
  2017-01-09 16:21     ` Dave Hansen
  0 siblings, 1 reply; 30+ messages in thread
From: Balbir Singh @ 2017-01-09  9:19 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On Fri, Jan 06, 2017 at 11:46:28AM -0500, Jérôme Glisse wrote:
> Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
> happen as ZONE_DEVICE page must always have an elevated refcount.
> 
> This is safety-net to catch any refcounting issues in a sane way for any
> ZONE_DEVICE pages.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  mm/page_alloc.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1c24112..355beb4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2445,6 +2445,16 @@ void free_hot_cold_page(struct page *page, bool cold)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype;
>  
> +	/*
> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
> +	 * active refcount. Complain about it and try to restore the refcount.
> +	 */
> +	if (is_zone_device_page(page)) {
> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);

This can be VM_BUG_ON_PAGE(1, page), hopefully the compiler does the right thing
here. I suspect this should be a BUG_ON, independent of CONFIG_DEBUG_VM

> +		page_ref_inc(page);
> +		return;
> +	}
> +

Balbir Singh.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2017-01-09  9:19   ` Balbir Singh
@ 2017-01-09 16:21     ` Dave Hansen
  2017-01-09 16:57       ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2017-01-09 16:21 UTC (permalink / raw)
  To: Balbir Singh, Jérôme Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 01/09/2017 01:19 AM, Balbir Singh wrote:
>> +	/*
>> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
>> +	 * active refcount. Complain about it and try to restore the refcount.
>> +	 */
>> +	if (is_zone_device_page(page)) {
>> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
> This can be VM_BUG_ON_PAGE(1, page), hopefully the compiler does the right thing
> here. I suspect this should be a BUG_ON, independent of CONFIG_DEBUG_VM

BUG_ON() means "kill the machine dead".  Do we really want a guaranteed
dead machine if someone screws up their refcounting?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2017-01-09 16:21     ` Dave Hansen
@ 2017-01-09 16:57       ` Jerome Glisse
  2017-01-09 17:00         ` Dave Hansen
  0 siblings, 1 reply; 30+ messages in thread
From: Jerome Glisse @ 2017-01-09 16:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Balbir Singh, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, Ross Zwisler

On Mon, Jan 09, 2017 at 08:21:25AM -0800, Dave Hansen wrote:
> On 01/09/2017 01:19 AM, Balbir Singh wrote:
> >> +	/*
> >> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
> >> +	 * active refcount. Complain about it and try to restore the refcount.
> >> +	 */
> >> +	if (is_zone_device_page(page)) {
> >> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
> > This can be VM_BUG_ON_PAGE(1, page), hopefully the compiler does the right thing
> > here. I suspect this should be a BUG_ON, independent of CONFIG_DEBUG_VM
> 
> BUG_ON() means "kill the machine dead".  Do we really want a guaranteed
> dead machine if someone screws up their refcounting?

VM_BUG_ON_PAGE ok with you ? It is just a safety net, i can simply drop that
patch if people have too much feeling about it.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2017-01-09 16:57       ` Jerome Glisse
@ 2017-01-09 17:00         ` Dave Hansen
  2017-01-09 17:58           ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Hansen @ 2017-01-09 17:00 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Balbir Singh, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, Ross Zwisler

On 01/09/2017 08:57 AM, Jerome Glisse wrote:
> On Mon, Jan 09, 2017 at 08:21:25AM -0800, Dave Hansen wrote:
>> On 01/09/2017 01:19 AM, Balbir Singh wrote:
>>>> +	/*
>>>> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
>>>> +	 * active refcount. Complain about it and try to restore the refcount.
>>>> +	 */
>>>> +	if (is_zone_device_page(page)) {
>>>> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
>>> This can be VM_BUG_ON_PAGE(1, page), hopefully the compiler does the right thing
>>> here. I suspect this should be a BUG_ON, independent of CONFIG_DEBUG_VM
>> BUG_ON() means "kill the machine dead".  Do we really want a guaranteed
>> dead machine if someone screws up their refcounting?
> VM_BUG_ON_PAGE ok with you ? It is just a safety net, i can simply drop that
> patch if people have too much feeling about it.

Enough distros turn on DEBUG_VM that there's basically no difference
between VM_BUG_ON() and BUG_ON().

I also think it would be much nicer if you buried the check in the
allocator in a slow path somewhere instead of sticking it in one of the
hottest paths in the whole kernel.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2017-01-09 17:00         ` Dave Hansen
@ 2017-01-09 17:58           ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2017-01-09 17:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Balbir Singh, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, Ross Zwisler

On Mon, Jan 09, 2017 at 09:00:34AM -0800, Dave Hansen wrote:
> On 01/09/2017 08:57 AM, Jerome Glisse wrote:
> > On Mon, Jan 09, 2017 at 08:21:25AM -0800, Dave Hansen wrote:
> >> On 01/09/2017 01:19 AM, Balbir Singh wrote:
> >>>> +	/*
> >>>> +	 * This should never happen ! Page from ZONE_DEVICE always must have an
> >>>> +	 * active refcount. Complain about it and try to restore the refcount.
> >>>> +	 */
> >>>> +	if (is_zone_device_page(page)) {
> >>>> +		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
> >>> This can be VM_BUG_ON_PAGE(1, page), hopefully the compiler does the right thing
> >>> here. I suspect this should be a BUG_ON, independent of CONFIG_DEBUG_VM
> >> BUG_ON() means "kill the machine dead".  Do we really want a guaranteed
> >> dead machine if someone screws up their refcounting?
> > VM_BUG_ON_PAGE ok with you ? It is just a safety net, i can simply drop that
> > patch if people have too much feeling about it.
> 
> Enough distros turn on DEBUG_VM that there's basically no difference
> between VM_BUG_ON() and BUG_ON().
> 
> I also think it would be much nicer if you buried the check in the
> allocator in a slow path somewhere instead of sticking it in one of the
> hottest paths in the whole kernel.

Well i will just drop that patch then. The point was to catch error
early on before anything happen. This is just a safety net so not
fundamental.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2
  2017-01-06 17:13     ` Jerome Glisse
@ 2017-01-10 15:30       ` David Nellans
  2017-01-10 16:58         ` Jerome Glisse
  0 siblings, 1 reply; 30+ messages in thread
From: David Nellans @ 2017-01-10 15:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti, Cameron Buschardt,
	Zi Yan, Anshuman Khandual


> You are mischaracterizing patch 11-14. Patch 11-12 adds new flags and
> modify existing functions so that they can be share. Patch 13 implement
> new migration helper while patch 14 optimize this new migration helper.
>
> hmm_migrate() is different from existing migration code because it works
> on virtual address range of a process. Existing migration code works
> from page. The only difference with existing code is that we collect
> pages from virtual address and we allow use of dma engine to perform
> copy.
You're right, but why not just introduce a new general migration interface
that works on vma range first, then case all the normal migration paths for
HMM and then DMA?  Being able to migrate based on vma range certainly
makes user level control of memory placement/migration less complicated
than
page interfaces.

> There is nothing that ie hmm_migrate() to HMM. If that make you feel better
> i can drop the hmm_ prefix but i would need another name than migrate() as
> it is already taken. I can probably name it vma_range_dma_migrate() or
> something like that.
>
> The only think that is HMM specific in this code is understanding HMM special
> page table entry and handling those. Such entry can only be migrated by DMA
> and not by memcpy hence why i do not modify existing code to support those.
I'd be happier if there was a vma_migrate proposed independently, I
think it would find
users outside the HMM sandbox. In the IBM migration case, they might
want the vma
interface but choose to use CPU based migration rather than this DMA
interface,
It certainly would make testing of the vma_migrate interface easier.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2
  2017-01-10 15:30       ` David Nellans
@ 2017-01-10 16:58         ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2017-01-10 16:58 UTC (permalink / raw)
  To: David Nellans
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti, Cameron Buschardt,
	Zi Yan, Anshuman Khandual

On Tue, Jan 10, 2017 at 09:30:30AM -0600, David Nellans wrote:
> 
> > You are mischaracterizing patch 11-14. Patch 11-12 adds new flags and
> > modify existing functions so that they can be share. Patch 13 implement
> > new migration helper while patch 14 optimize this new migration helper.
> >
> > hmm_migrate() is different from existing migration code because it works
> > on virtual address range of a process. Existing migration code works
> > from page. The only difference with existing code is that we collect
> > pages from virtual address and we allow use of dma engine to perform
> > copy.
> You're right, but why not just introduce a new general migration interface
> that works on vma range first, then case all the normal migration paths for
> HMM and then DMA?  Being able to migrate based on vma range certainly
> makes user level control of memory placement/migration less complicated
> than page interfaces.

Special casing for HMM and DMA is already what those patches do. They share
as much code as doable with existing path. There is one thing to consider
here, because we are working on vma range we can easily optimize the unmap
step. This is why i do not share any of the outer loop with existing code.

Sharing more code than this will be counter-productive from optimization
point of view.

> 
> > There is nothing that ie hmm_migrate() to HMM. If that make you feel better
> > i can drop the hmm_ prefix but i would need another name than migrate() as
> > it is already taken. I can probably name it vma_range_dma_migrate() or
> > something like that.
> >
> > The only think that is HMM specific in this code is understanding HMM special
> > page table entry and handling those. Such entry can only be migrated by DMA
> > and not by memcpy hence why i do not modify existing code to support those.
> I'd be happier if there was a vma_migrate proposed independently, I think
> it would find users outside the HMM sandbox. In the IBM migration case,
> they might want the vma interface but choose to use CPU based migration
> rather than this DMA interface, It certainly would make testing of the
> vma_migrate interface easier.

Like i said that code is not in HMM sandbox, it seats behind its own kernel
option and do not rely on any HMM thing beside hmm_pfn_t which is pfn with
a bunch of flags. The only difference with existing code is that it does
understand HMM CPU pte. It can easily be rename without hmm_ prefix if that
is what people want. The hmm_pfn_t is harder to replace as there isn't any-
thing that match the requirement (need few flags: DEVICE,MIGRATE,EMPTY,
UNADDRESSABLE).

The DMA is a callback function the caller of hmm_migrate() provide so you can
easily provide a callback that just do memcpy (well copy_highpage()). There
is no need to make any change. I can even provide a default CPU copy call-
back.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-01-10 16:58 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-06 16:46 [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
2017-01-09  9:19   ` Balbir Singh
2017-01-09 16:21     ` Dave Hansen
2017-01-09 16:57       ` Jerome Glisse
2017-01-09 17:00         ` Dave Hansen
2017-01-09 17:58           ` Jerome Glisse
2017-01-06 16:46 ` [HMM v15 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
2017-01-06 17:58   ` Dan Williams
2017-01-06 16:46 ` [HMM v15 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2 Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 06/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 07/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 09/16] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 10/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
2017-01-06 16:46   ` David Nellans
2017-01-06 17:13     ` Jerome Glisse
2017-01-10 15:30       ` David Nellans
2017-01-10 16:58         ` Jerome Glisse
2017-01-06 16:46 ` [HMM v15 14/16] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
2017-01-06 16:46 ` [HMM v15 16/16] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
2017-01-06 20:54 ` [HMM v15 00/16] HMM (Heterogeneous Memory Management) v15 Dave Hansen
2017-01-06 21:16   ` Jerome Glisse
2017-01-06 21:36     ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).