Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-08 16:39 ` [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
@ 2016-12-08 16:21   ` Dave Hansen
  2016-12-08 16:39     ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2016-12-08 16:21 UTC (permalink / raw)
  To: Jérôme Glisse, akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Dan Williams, Ross Zwisler

On 12/08/2016 08:39 AM, Jérôme Glisse wrote:
> Architecture that wish to support un-addressable device memory should make
> sure to never populate the kernel linar mapping for the physical range.

Does the platform somehow provide a range of physical addresses for this
unaddressable area?  How do we know no memory will be hot-added in a
range we're using for unaddressable device memory, for instance?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14
@ 2016-12-08 16:39 Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
                   ` (15 more replies)
  0 siblings, 16 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Cliff note: HMM offers 2 things (each standing on its own). First
it allows to use device memory transparently inside any process
without any modifications to process program code. Second it allows
to mirror process address space on a device.

Change since v13 are small, it was about including everyone remarks
into the patchset (splitting each features into its own kernel config,
adding better comment, splitting optimization from base implementation,
improved comments, ...).


Patchset is divided into 3 features that can each be use independently
from one another. First is changes to ZONE_DEVICE so we can have struct
page for device un-addressable memory (patch 2-6). Second is process
address space mirroring (patch 8 to 10), this allow to snapshot CPU
page table and to keep the device page table synchronize with the CPU
one.

Last is a new page migration helper which allow migration for range of
virtual address using hardware copy engine (patch 11-14).

Other patches just introduce common definitions or add safety net to
catch wrong use of some of the features.


Andrew do you want anyone specific to review any specific part of the
patchset before considering it for inclusion ? At this point i want
to know if there is ever a chance of getting this upstream or do we
decide that we don't want to support this kind of hardware ?


In this patchset i restricted myself to set of core features what
is missing:
  - force read only on CPU for memory duplication and GPU atomic
  - changes to mmu_notifier for optimization purposes
  - migration of file back page to device memory

I plan to submit a couple more patchset to implement those features
once core HMM is upstream.


Previous patchset posting :
    v1 http://lwn.net/Articles/597289/
    v2 https://lkml.org/lkml/2014/6/12/559
    v3 https://lkml.org/lkml/2014/6/13/633
    v4 https://lkml.org/lkml/2014/8/29/423
    v5 https://lkml.org/lkml/2014/11/3/759
    v6 http://lwn.net/Articles/619737/
    v7 http://lwn.net/Articles/627316/
    v8 https://lwn.net/Articles/645515/
    v9 https://lwn.net/Articles/651553/
    v10 https://lwn.net/Articles/654430/
    v11 http://www.gossamer-threads.com/lists/linux/kernel/2286424
    v12 http://www.kernelhub.org/?msg=972982&p=2
    v13 https://lwn.net/Articles/706856/

Cheers,
Jérôme

Jérôme Glisse (16):
  mm/free_hot_cold_page: catch ZONE_DEVICE pages
  mm/memory/hotplug: convert device bool to int to allow for more flags
    v2
  mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device
    memory
  mm/ZONE_DEVICE/free-page: callback when page is freed
  mm/ZONE_DEVICE/unaddressable: add support for un-addressable device
    memory
  mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  mm/hmm: heterogeneous memory management (HMM for short)
  mm/hmm/mirror: mirror process address space on device with HMM helpers
  mm/hmm/mirror: helper to snapshot CPU page table
  mm/hmm/mirror: device page fault handler
  mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  mm/hmm/migrate: add new boolean copy flag to migratepage() callback
  mm/hmm/migrate: new memory migration helper for use with device memory
    v2
  mm/hmm/migrate: optimize page map once in vma being migrated
  mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
  mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory

 MAINTAINERS                                |    7 +
 arch/ia64/mm/init.c                        |   23 +-
 arch/powerpc/mm/mem.c                      |   22 +-
 arch/s390/mm/init.c                        |   10 +-
 arch/sh/mm/init.c                          |   22 +-
 arch/tile/mm/init.c                        |   10 +-
 arch/x86/mm/init_32.c                      |   23 +-
 arch/x86/mm/init_64.c                      |   41 +-
 drivers/dax/pmem.c                         |    3 +-
 drivers/nvdimm/pmem.c                      |    7 +-
 drivers/staging/lustre/lustre/llite/rw26.c |    8 +-
 fs/aio.c                                   |    7 +-
 fs/btrfs/disk-io.c                         |   11 +-
 fs/hugetlbfs/inode.c                       |    9 +-
 fs/nfs/internal.h                          |    5 +-
 fs/nfs/write.c                             |    9 +-
 fs/proc/task_mmu.c                         |   10 +-
 fs/ubifs/file.c                            |    8 +-
 include/linux/balloon_compaction.h         |    3 +-
 include/linux/fs.h                         |   13 +-
 include/linux/hmm.h                        |  525 ++++++++++++++
 include/linux/memory_hotplug.h             |   31 +-
 include/linux/memremap.h                   |   60 +-
 include/linux/migrate.h                    |    7 +-
 include/linux/mm_types.h                   |    5 +
 include/linux/swap.h                       |   18 +-
 include/linux/swapops.h                    |   67 ++
 kernel/fork.c                              |    2 +
 kernel/memremap.c                          |   69 +-
 mm/Kconfig                                 |   51 ++
 mm/Makefile                                |    1 +
 mm/balloon_compaction.c                    |    2 +-
 mm/hmm.c                                   | 1082 ++++++++++++++++++++++++++++
 mm/memory.c                                |   62 ++
 mm/memory_hotplug.c                        |    4 +-
 mm/migrate.c                               |  687 +++++++++++++++++-
 mm/mprotect.c                              |   12 +
 mm/page_alloc.c                            |   10 +
 mm/rmap.c                                  |   47 ++
 mm/zsmalloc.c                              |   12 +-
 tools/testing/nvdimm/test/iomap.c          |    3 +-
 41 files changed, 2924 insertions(+), 84 deletions(-)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

-- 
2.4.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [HMM v14 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

Catch page from ZONE_DEVICE in free_hot_cold_page(). This should never
happen as ZONE_DEVICE page must always have an elevated refcount.

This is safety-net to catch any refcounting issues in a sane way for any
ZONE_DEVICE pages.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fbfead..09b2630 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2435,6 +2435,16 @@ void free_hot_cold_page(struct page *page, bool cold)
 	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
+	/*
+	 * This should never happen ! Page from ZONE_DEVICE always must have an
+	 * active refcount. Complain about it and try to restore the refcount.
+	 */
+	if (is_zone_device_page(page)) {
+		VM_BUG_ON_PAGE(is_zone_device_page(page), page);
+		page_ref_inc(page);
+		return;
+	}
+
 	if (!free_pcp_prepare(page))
 		return;
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Russell King,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	Chris Metcalf, Thomas Gleixner, Ingo Molnar, H. Peter Anvin

When hotpluging memory we want more informations on the type of memory and
its properties. Replace the device boolean flag by an int and define a set
of flags.

New property for device memory is an opt-in flag to allow page migration
from and to a ZONE_DEVICE. Existing user of ZONE_DEVICE are not expecting
page migration to work for their pages. New changes to page migration i
changing that and we now need a flag to explicitly opt-in page migration.

Changes since v1:
  - Improved commit message
  - Improved define name
  - Improved comments
  - Typos

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/ia64/mm/init.c            | 23 ++++++++++++++++++++---
 arch/powerpc/mm/mem.c          | 22 +++++++++++++++++++---
 arch/s390/mm/init.c            | 10 ++++++++--
 arch/sh/mm/init.c              | 22 +++++++++++++++++++---
 arch/tile/mm/init.c            | 10 ++++++++--
 arch/x86/mm/init_32.c          | 23 ++++++++++++++++++++---
 arch/x86/mm/init_64.c          | 23 ++++++++++++++++++++---
 include/linux/memory_hotplug.h | 24 ++++++++++++++++++++++--
 include/linux/memremap.h       | 11 +++++++++++
 kernel/memremap.c              |  4 ++--
 mm/memory_hotplug.c            |  4 ++--
 11 files changed, 151 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 1841ef6..303027e 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -645,18 +645,27 @@ mem_init (void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	pg_data_t *pgdat;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 
 	if (ret)
@@ -667,13 +676,21 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 5f84433..6e877d3 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -126,14 +126,22 @@ int __weak remove_section_mapping(unsigned long start, unsigned long end)
 	return -ENODEV;
 }
 
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdata;
 	struct zone *zone;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int rc;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdata = NODE_DATA(nid);
 
 	start = (unsigned long)__va(start);
@@ -147,19 +155,27 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 
 	/* this should work for most non-highmem platforms */
 	zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, 0, for_device);
+		zone_for_memory(nid, start, size, 0, flags & MEMORY_DEVICE);
 
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (ret)
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index f56a39b..00bae81 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -149,7 +149,7 @@ void __init free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	unsigned long normal_end_pfn = PFN_DOWN(memblock_end_of_DRAM());
 	unsigned long dma_end_pfn = PFN_DOWN(MAX_DMA_ADDRESS);
@@ -158,6 +158,12 @@ int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
 	unsigned long nr_pages;
 	int rc, zone_enum;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags) {
+		BUG();
+		return -EINVAL;
+	}
+
 	rc = vmem_add_mapping(start, size);
 	if (rc)
 		return rc;
@@ -197,7 +203,7 @@ unsigned long memory_block_size_bytes(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/*
 	 * There is no hardware or firmware interface which could trigger a
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 7549186..0ca69ac 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -485,19 +485,27 @@ void free_initrd_mem(unsigned long start, unsigned long end)
 #endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	pg_data_t *pgdat;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	pgdat = NODE_DATA(nid);
 
 	/* We only have ZONE_NORMAL, so this is easy.. */
 	ret = __add_pages(nid, pgdat->node_zones +
 			zone_for_memory(nid, start, size, ZONE_NORMAL,
-			for_device),
+					flags & MEMORY_DEVICE),
 			start_pfn, nr_pages);
 	if (unlikely(ret))
 		printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
@@ -516,13 +524,21 @@ EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
 #endif
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	if (unlikely(ret))
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index adce254..ba001b1 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -863,13 +863,19 @@ void __init mem_init(void)
  * memory to the highmem for now.
  */
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-int arch_add_memory(u64 start, u64 size, bool for_device)
+int arch_add_memory(u64 start, u64 size, int flags)
 {
 	struct pglist_data *pgdata = &contig_page_data;
 	struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(zone, start_pfn, nr_pages);
 }
 
@@ -879,7 +885,7 @@ int remove_memory(u64 start, u64 size)
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
 	/* TODO */
 	return -EBUSY;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cf80590..8287a4b 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -816,24 +816,41 @@ void __init mem_init(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdata = NODE_DATA(nid);
 	struct zone *zone = pgdata->node_zones +
-		zone_for_memory(nid, start, size, ZONE_HIGHMEM, for_device);
+		zone_for_memory(nid, start, size, ZONE_HIGHMEM,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	return __add_pages(nid, zone, start_pfn, nr_pages);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-int arch_remove_memory(u64 start, u64 size)
+int arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct zone *zone;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	zone = page_zone(pfn_to_page(start_pfn));
 	return __remove_pages(zone, start_pfn, nr_pages);
 }
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 14b9dd7..442ac86 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -651,15 +651,24 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
  * Memory is added always to NORMAL zone. This means you will never get
  * additional DMA/DMA32 memory.
  */
-int arch_add_memory(int nid, u64 start, u64 size, bool for_device)
+int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL, for_device);
+		zone_for_memory(nid, start, size, ZONE_NORMAL,
+				flags & MEMORY_DEVICE);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
@@ -956,8 +965,10 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	remove_pagetable(start, end, true);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
+	const int supported_flags = MEMORY_DEVICE |
+				    MEMORY_DEVICE_ALLOW_MIGRATE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(start_pfn);
@@ -965,6 +976,12 @@ int __ref arch_remove_memory(u64 start, u64 size)
 	struct zone *zone;
 	int ret;
 
+	/* Each flag need special handling so error out on un-supported flag */
+	if (flags & (~supported_flags)) {
+		BUG();
+		return -EINVAL;
+	}
+
 	/* With altmap the first mapped page is offset from @start */
 	altmap = to_vmem_altmap((unsigned long) page);
 	if (altmap)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 01033fa..3f50eb8 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -103,7 +103,7 @@ extern bool memhp_auto_online;
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
-extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_memory(u64 start, u64 size, int flags);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -275,7 +275,27 @@ extern int add_memory(int nid, u64 start, u64 size);
 extern int add_memory_resource(int nid, struct resource *resource, bool online);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
 		bool for_device);
-extern int arch_add_memory(int nid, u64 start, u64 size, bool for_device);
+
+/*
+ * When hotpluging memory with arch_add_memory() we want more informations on
+ * the type of memory and its properties. The flags parameter allow to provide
+ * more informations on the memory which is being addedd.
+ *
+ * Provide an opt-in flag for struct page migration. Persistent device memory
+ * never relied on struct page migration so far and new user of might also
+ * prefer avoiding struct page migration.
+ *
+ * New non device memory specific flags can be added if ever needed.
+ *
+ * MEMORY_REGULAR: regular system memory
+ * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
+ * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ */
+#define MEMORY_NORMAL 0
+#define MEMORY_DEVICE (1 << 0)
+#define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+
+extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 9341619..f7e0609 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,6 +53,12 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+	return ((page_zonenum(page) == ZONE_DEVICE) &&
+		(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 		struct resource *res, struct percpu_ref *ref,
@@ -71,6 +77,11 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 {
 	return NULL;
 }
+
+static inline bool dev_page_allow_migrate(const struct page *page)
+{
+	return false;
+}
 #endif
 
 /**
diff --git a/kernel/memremap.c b/kernel/memremap.c
index b501e39..07665eb 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -246,7 +246,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size);
+	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -358,7 +358,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, true);
+	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
 	if (error)
 		goto err_add_memory;
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9629273..9e588b2 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1386,7 +1386,7 @@ int __ref add_memory_resource(int nid, struct resource *res, bool online)
 	}
 
 	/* call arch's memory hotadd */
-	ret = arch_add_memory(nid, start, size, false);
+	ret = arch_add_memory(nid, start, size, MEMORY_NORMAL);
 
 	if (ret < 0)
 		goto error;
@@ -2205,7 +2205,7 @@ void __ref remove_memory(int nid, u64 start, u64 size)
 	memblock_free(start, size);
 	memblock_remove(start, size);
 
-	arch_remove_memory(start, size);
+	arch_remove_memory(start, size, MEMORY_NORMAL);
 
 	try_offline_node(nid);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

Some device driver manage multiple physical devices memory from a single
fake device driver. In that case the fake device might outlive the real
device and ZONE_DEVICE and its resource allocated for a real device would
waste resources in the meantime.

This patch allow early removal of ZONE_DEVICE and associated resource,
before device driver is tear down.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/memremap.h |  7 +++++++
 kernel/memremap.c        | 14 ++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f7e0609..32314d2 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -53,6 +53,7 @@ struct dev_pagemap {
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 		struct percpu_ref *ref, struct vmem_altmap *altmap);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
+int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
 static inline bool dev_page_allow_migrate(const struct page *page)
 {
@@ -78,6 +79,12 @@ static inline struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
 	return NULL;
 }
 
+static inline int devm_memremap_pages_remove(struct device *dev,
+					     struct dev_pagemap *pgmap)
+{
+	return -EINVAL;
+}
+
 static inline bool dev_page_allow_migrate(const struct page *page)
 {
 	return false;
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 07665eb..250ef25 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -387,6 +387,20 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 }
 EXPORT_SYMBOL(devm_memremap_pages);
 
+static int devm_page_map_match(struct device *dev, void *data, void *match_data)
+{
+	struct page_map *page_map = data;
+
+	return &page_map->pgmap == match_data;
+}
+
+int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap)
+{
+	return devres_release(dev, &devm_memremap_pages_release,
+			      &devm_page_map_match, pgmap);
+}
+EXPORT_SYMBOL(devm_memremap_pages_remove);
+
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 {
 	/* number of pfns from base where pfn_to_page() is valid */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (2 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

When a ZONE_DEVICE page refcount reach 1 it means it is free and nobody
is holding a reference on it (only device to which the memory belong do).
Add a callback and call it when that happen so device driver can implement
their own free page management.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/dax/pmem.c                |  3 ++-
 drivers/nvdimm/pmem.c             |  5 +++--
 include/linux/memremap.h          | 17 ++++++++++++++---
 kernel/memremap.c                 | 14 +++++++++++++-
 tools/testing/nvdimm/test/iomap.c |  2 +-
 5 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 1f01e98..52ff674 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -107,7 +107,8 @@ static int dax_pmem_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap);
+	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
+				   altmap, NULL, NULL);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 571a6c7..c261d12 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,7 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
-				altmap);
+					   altmap, NULL, NULL);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -269,7 +269,8 @@ static int pmem_attach_disk(struct device *dev,
 		res->start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
-				&q->q_usage_counter, NULL);
+					   &q->q_usage_counter,
+					   NULL, NULL, NULL);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 32314d2..7845f2e 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,23 +35,31 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+typedef void (*dev_page_free_t)(struct page *page, void *data);
+
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_free: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
+ * @data: privata data pointer for page_free
  */
 struct dev_pagemap {
+	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
+	void *data;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap);
+			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  dev_page_free_t page_free,
+			  void *data);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
@@ -62,8 +70,11 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 }
 #else
 static inline void *devm_memremap_pages(struct device *dev,
-		struct resource *res, struct percpu_ref *ref,
-		struct vmem_altmap *altmap)
+					struct resource *res,
+					struct percpu_ref *ref,
+					struct vmem_altmap *altmap,
+					dev_page_free_t page_free,
+					void *data)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 250ef25..bc1e400 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -190,6 +190,12 @@ EXPORT_SYMBOL(get_zone_device_page);
 
 void put_zone_device_page(struct page *page)
 {
+	/*
+	 * If refcount is 1 then page is freed and refcount is stable as nobody
+	 * holds a reference on the page.
+	 */
+	if (page->pgmap->page_free && page_count(page) == 1)
+		page->pgmap->page_free(page, page->pgmap->data);
 	put_dev_pagemap(page->pgmap);
 }
 EXPORT_SYMBOL(put_zone_device_page);
@@ -270,6 +276,8 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  * @res: "host memory" address range
  * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
+ * @page_free: callback call when page refcount reach 1 ie it is free
+ * @data: privata data pointer for page_free
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -280,7 +288,9 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  *    this is not enforced.
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
-		struct percpu_ref *ref, struct vmem_altmap *altmap)
+			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  dev_page_free_t page_free,
+			  void *data)
 {
 	resource_size_t key, align_start, align_size, align_end;
 	pgprot_t pgprot = PAGE_KERNEL;
@@ -322,6 +332,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->page_free = page_free;
+	pgmap->data = data;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index c29f8dc..6505a87 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -108,7 +108,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,
 
 	if (nfit_res)
 		return nfit_res->buf + offset - nfit_res->res->start;
-	return devm_memremap_pages(dev, res, ref, altmap);
+	return devm_memremap_pages(dev, res, ref, altmap, NULL, NULL);
 }
 EXPORT_SYMBOL(__wrap_devm_memremap_pages);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (3 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:21   ` Dave Hansen
  2016-12-08 16:39 ` [HMM v14 06/16] mm/ZONE_DEVICE/x86: " Jérôme Glisse
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Dan Williams, Ross Zwisler

This add support for un-addressable device memory. Such memory is hotpluged
only so we can have struct page but we should never map them as such memory
can not be accessed by CPU. For that reason it uses a special swap entry for
CPU page table entry.

This patch implement all the logic from special swap type to handling CPU
page fault through a callback specified in the ZONE_DEVICE pgmap struct.

Architecture that wish to support un-addressable device memory should make
sure to never populate the kernel linar mapping for the physical range.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/dax/pmem.c                |  4 +--
 drivers/nvdimm/pmem.c             |  6 ++--
 fs/proc/task_mmu.c                | 10 +++++-
 include/linux/memory_hotplug.h    |  7 ++++
 include/linux/memremap.h          | 29 +++++++++++++++--
 include/linux/swap.h              | 18 +++++++++--
 include/linux/swapops.h           | 67 +++++++++++++++++++++++++++++++++++++++
 kernel/memremap.c                 | 43 +++++++++++++++++++++++--
 mm/Kconfig                        | 12 +++++++
 mm/memory.c                       | 62 ++++++++++++++++++++++++++++++++++++
 mm/mprotect.c                     | 12 +++++++
 tools/testing/nvdimm/test/iomap.c |  3 +-
 12 files changed, 259 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index 52ff674..f65a68a 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -107,8 +107,8 @@ static int dax_pmem_probe(struct device *dev)
 	if (rc)
 		return rc;
 
-	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
-				   altmap, NULL, NULL);
+	addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap,
+				   NULL, NULL, NULL, NULL, MEMORY_DEVICE);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index c261d12..dcad86f 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -260,7 +260,8 @@ static int pmem_attach_disk(struct device *dev,
 	pmem->pfn_flags = PFN_DEV;
 	if (is_nd_pfn(dev)) {
 		addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
-					   altmap, NULL, NULL);
+					   altmap, NULL, NULL, NULL,
+					   NULL, MEMORY_DEVICE);
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
@@ -270,7 +271,8 @@ static int pmem_attach_disk(struct device *dev,
 	} else if (pmem_should_map_pages(dev)) {
 		addr = devm_memremap_pages(dev, &nsio->res,
 					   &q->q_usage_counter,
-					   NULL, NULL, NULL);
+					   NULL, NULL, NULL, NULL,
+					   NULL, MEMORY_DEVICE);
 		pmem->pfn_flags |= PFN_MAP;
 	} else
 		addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6909582..0726d39 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -544,8 +544,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 			} else {
 				mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
 			}
-		} else if (is_migration_entry(swpent))
+		} else if (is_migration_entry(swpent)) {
 			page = migration_entry_to_page(swpent);
+		} else if (is_device_entry(swpent)) {
+			page = device_entry_to_page(swpent);
+		}
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
 		page = find_get_entry(vma->vm_file->f_mapping,
@@ -708,6 +711,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 
 		if (is_migration_entry(swpent))
 			page = migration_entry_to_page(swpent);
+		if (is_device_entry(swpent))
+			page = device_entry_to_page(swpent);
 	}
 	if (page) {
 		int mapcount = page_mapcount(page);
@@ -1191,6 +1196,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_migration_entry(entry))
 			page = migration_entry_to_page(entry);
+
+		if (is_device_entry(entry))
+			page = device_entry_to_page(entry);
 	}
 
 	if (page && !PageAnon(page))
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3f50eb8..e7c5dc6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -285,15 +285,22 @@ extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
  * never relied on struct page migration so far and new user of might also
  * prefer avoiding struct page migration.
  *
+ * For device memory (which use ZONE_DEVICE) we want differentiate between CPU
+ * accessible memory (persitent memory, device memory on an architecture with a
+ * system bus that allow transparent access to device memory) and unaddressable
+ * memory (device memory that can not be accessed by CPU directly).
+ *
  * New non device memory specific flags can be added if ever needed.
  *
  * MEMORY_REGULAR: regular system memory
  * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
  * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
+ * MEMORY_DEVICE_UNADDRESSABLE: un-addressable memory (CPU can not access it)
  */
 #define MEMORY_NORMAL 0
 #define MEMORY_DEVICE (1 << 0)
 #define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
+#define MEMORY_DEVICE_UNADDRESSABLE (1 << 2)
 
 extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 7845f2e..a646c47 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -35,31 +35,42 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
 }
 #endif
 
+typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
+				unsigned long addr,
+				struct page *page,
+				unsigned flags,
+				pmd_t *pmdp);
 typedef void (*dev_page_free_t)(struct page *page, void *data);
 
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
+ * @page_fault: callback when CPU fault on an un-addressable device page
  * @page_free: free page callback when page refcount reach 1
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
  * @res: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @dev: host device of the mapping for debug
  * @data: privata data pointer for page_free
+ * @flags: device memory flags (look for MEMORY_DEVICE_* memory_hotplug.h)
  */
 struct dev_pagemap {
+	dev_page_fault_t page_fault;
 	dev_page_free_t page_free;
 	struct vmem_altmap *altmap;
 	const struct resource *res;
 	struct percpu_ref *ref;
 	struct device *dev;
 	void *data;
+	int flags;
 };
 
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  struct dev_pagemap **ppgmap,
+			  dev_page_fault_t page_fault,
 			  dev_page_free_t page_free,
-			  void *data);
+			  void *data, int flags);
 struct dev_pagemap *find_dev_pagemap(resource_size_t phys);
 int devm_memremap_pages_remove(struct device *dev, struct dev_pagemap *pgmap);
 
@@ -68,13 +79,22 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 	return ((page_zonenum(page) == ZONE_DEVICE) &&
 		(page->pgmap->flags & MEMORY_DEVICE_ALLOW_MIGRATE));
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return ((page_zonenum(page) != ZONE_DEVICE) ||
+		!(page->pgmap->flags & MEMORY_DEVICE_UNADDRESSABLE));
+}
 #else
 static inline void *devm_memremap_pages(struct device *dev,
 					struct resource *res,
 					struct percpu_ref *ref,
 					struct vmem_altmap *altmap,
+					struct dev_pagemap **ppgmap,
+					dev_page_fault_t page_fault,
 					dev_page_free_t page_free,
-					void *data)
+					void *data,
+					int flags)
 {
 	/*
 	 * Fail attempts to call devm_memremap_pages() without
@@ -100,6 +120,11 @@ static inline bool dev_page_allow_migrate(const struct page *page)
 {
 	return false;
 }
+
+static inline bool is_addressable_page(const struct page *page)
+{
+	return true;
+}
 #endif
 
 /**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7e553e1..599cb54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -50,6 +50,17 @@ static inline int current_is_kswapd(void)
  */
 
 /*
+ * Un-addressable device memory support
+ */
+#ifdef CONFIG_DEVICE_UNADDRESSABLE
+#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_WRITE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+#define SWP_DEVICE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM + 1)
+#else
+#define SWP_DEVICE_NUM 0
+#endif
+
+/*
  * NUMA node memory migration support
  */
 #ifdef CONFIG_MIGRATION
@@ -71,7 +82,8 @@ static inline int current_is_kswapd(void)
 #endif
 
 #define MAX_SWAPFILES \
-	((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
@@ -442,8 +454,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(swp)	is_migration_entry(swp)
-#define swapcache_prepare(swp)		is_migration_entry(swp)
+#define free_swap_and_cache(e) (is_migration_entry(e) || is_device_entry(e))
+#define swapcache_prepare(e) (is_migration_entry(e) || is_device_entry(e))
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5c3a5f3..0e339f0 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,6 +100,73 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
 }
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(write?SWP_DEVICE_WRITE:SWP_DEVICE, page_to_pfn(page));
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	int type = swp_type(entry);
+	return type == SWP_DEVICE || type == SWP_DEVICE_WRITE;
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+	*entry = swp_entry(SWP_DEVICE, swp_offset(*entry));
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return pfn_to_page(swp_offset(entry));
+}
+
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp);
+#else /* CONFIG_DEVICE_UNADDRESSABLE */
+static inline swp_entry_t make_device_entry(struct page *page, bool write)
+{
+	return swp_entry(0, 0);
+}
+
+static inline void make_device_entry_read(swp_entry_t *entry)
+{
+}
+
+static inline bool is_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_write_device_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline struct page *device_entry_to_page(swp_entry_t entry)
+{
+	return NULL;
+}
+
+static inline int device_entry_fault(struct vm_area_struct *vma,
+				     unsigned long addr,
+				     swp_entry_t entry,
+				     unsigned flags,
+				     pmd_t *pmdp)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 #ifdef CONFIG_MIGRATION
 static inline swp_entry_t make_migration_entry(struct page *page, int write)
 {
diff --git a/kernel/memremap.c b/kernel/memremap.c
index bc1e400..3df08f4 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -18,6 +18,8 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 #ifndef ioremap_cache
 /* temporary while we convert existing ioremap_cache users to memremap */
@@ -200,6 +202,21 @@ void put_zone_device_page(struct page *page)
 }
 EXPORT_SYMBOL(put_zone_device_page);
 
+#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
+int device_entry_fault(struct vm_area_struct *vma,
+		       unsigned long addr,
+		       swp_entry_t entry,
+		       unsigned flags,
+		       pmd_t *pmdp)
+{
+	struct page *page = device_entry_to_page(entry);
+
+	BUG_ON(!page->pgmap->page_fault);
+	return page->pgmap->page_fault(vma, addr, page, flags, pmdp);
+}
+EXPORT_SYMBOL(device_entry_fault);
+#endif /* CONFIG_DEVICE_UNADDRESSABLE */
+
 static void pgmap_radix_release(struct resource *res)
 {
 	resource_size_t key, align_start, align_size, align_end;
@@ -252,7 +269,7 @@ static void devm_memremap_pages_release(struct device *dev, void *data)
 	/* pages are dead and unused, undo the arch mapping */
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(resource_size(res), SECTION_SIZE);
-	arch_remove_memory(align_start, align_size, MEMORY_DEVICE);
+	arch_remove_memory(align_start, align_size, pgmap->flags);
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
 	pgmap_radix_release(res);
 	dev_WARN_ONCE(dev, pgmap->altmap && pgmap->altmap->alloc,
@@ -276,8 +293,11 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  * @res: "host memory" address range
  * @ref: a live per-cpu reference count
  * @altmap: optional descriptor for allocating the memmap from @res
+ * @ppgmap: pointer set to new page dev_pagemap on success
+ * @page_fault: callback for CPU page fault on un-addressable memory
  * @page_free: callback call when page refcount reach 1 ie it is free
  * @data: privata data pointer for page_free
+ * @flags: device memory flags (look for MEMORY_DEVICE_* memory_hotplug.h)
  *
  * Notes:
  * 1/ @ref must be 'live' on entry and 'dead' before devm_memunmap_pages() time
@@ -289,8 +309,10 @@ struct dev_pagemap *find_dev_pagemap(resource_size_t phys)
  */
 void *devm_memremap_pages(struct device *dev, struct resource *res,
 			  struct percpu_ref *ref, struct vmem_altmap *altmap,
+			  struct dev_pagemap **ppgmap,
+			  dev_page_fault_t page_fault,
 			  dev_page_free_t page_free,
-			  void *data)
+			  void *data, int flags)
 {
 	resource_size_t key, align_start, align_size, align_end;
 	pgprot_t pgprot = PAGE_KERNEL;
@@ -299,6 +321,17 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	int error, nid, is_ram;
 	unsigned long pfn;
 
+	if (!(flags & MEMORY_DEVICE)) {
+		WARN_ONCE(1, "%s attempted on non device memory\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (altmap && (flags & MEMORY_DEVICE_UNADDRESSABLE)) {
+		WARN_ONCE(1, "%s with altmap for un-addressable "
+			  "device memory\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
 	align_start = res->start & ~(SECTION_SIZE - 1);
 	align_size = ALIGN(res->start + resource_size(res), SECTION_SIZE)
 		- align_start;
@@ -332,8 +365,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	}
 	pgmap->ref = ref;
 	pgmap->res = &page_map->res;
+	pgmap->page_fault = page_fault;
 	pgmap->page_free = page_free;
 	pgmap->data = data;
+	pgmap->flags = flags;
 
 	mutex_lock(&pgmap_lock);
 	error = 0;
@@ -370,7 +405,7 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 	if (error)
 		goto err_pfn_remap;
 
-	error = arch_add_memory(nid, align_start, align_size, MEMORY_DEVICE);
+	error = arch_add_memory(nid, align_start, align_size, pgmap->flags);
 	if (error)
 		goto err_add_memory;
 
@@ -387,6 +422,8 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
 		page->pgmap = pgmap;
 	}
 	devres_add(dev, page_map);
+	if (ppgmap)
+		*ppgmap = pgmap;
 	return __va(res->start);
 
  err_add_memory:
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..8564a5f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -704,6 +704,18 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config DEVICE_UNADDRESSABLE
+	bool "Un-addressable device memory (GPU memory, ...)"
+	depends on ZONE_DEVICE
+
+	help
+	  Allow to create struct page for un-addressable device memory
+	  ie memory that is only accessible by the device (or group of
+	  devices).
+
+	  Having struct page is necessary for process memory migration
+	  to device memory.
+
 config FRAME_VECTOR
 	bool
 
diff --git a/mm/memory.c b/mm/memory.c
index 840adc6..03306cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/memremap.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -888,6 +889,25 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 					pte = pte_swp_mksoft_dirty(pte);
 				set_pte_at(src_mm, addr, src_pte, pte);
 			}
+		} else if (is_device_entry(entry)) {
+			page = device_entry_to_page(entry);
+
+			/*
+			 * Update rss count even for un-addressable page as
+			 * they should be consider just like any other page.
+			 */
+			get_page(page);
+			rss[mm_counter(page)]++;
+			page_dup_rmap(page, false);
+
+			if (is_write_device_entry(entry) &&
+			    is_cow_mapping(vm_flags)) {
+				make_device_entry_read(&entry);
+				pte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(*src_pte))
+					pte = pte_swp_mksoft_dirty(pte);
+				set_pte_at(src_mm, addr, src_pte, pte);
+			}
 		}
 		goto out_set_pte;
 	}
@@ -1178,6 +1198,32 @@ again:
 			}
 			continue;
 		}
+
+		/*
+		 * Un-addressable page must always be check that are not like
+		 * other swap entries and thus should be check no matter what
+		 * details->check_swap_entries value is.
+		 */
+		entry = pte_to_swp_entry(ptent);
+		if (non_swap_entry(entry) && is_device_entry(entry)) {
+			struct page *page = device_entry_to_page(entry);
+
+			if (unlikely(details)) {
+				/*
+				 * unmap_shared_mapping_pages() wants to
+				 * invalidate cache without truncating:
+				 * unmap shared but keep private pages.
+				 */
+				if (details->check_mapping &&
+				    details->check_mapping != page_rmapping(page))
+					continue;
+			}
+
+			rss[mm_counter(page)]--;
+			page_remove_rmap(page, false);
+			put_page(page);
+		}
+
 		/* only check swap_entries if explicitly asked for in details */
 		if (unlikely(details && !details->check_swap_entries))
 			continue;
@@ -2535,6 +2581,14 @@ int do_swap_page(struct fault_env *fe, pte_t orig_pte)
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
+		} else if (is_device_entry(entry)) {
+			/*
+			 * For un-addressable device memory we call the pgmap
+			 * fault handler callback. The callback must migrate
+			 * the page back to some CPU accessible page.
+			 */
+			ret = device_entry_fault(vma, fe->address, entry,
+						 fe->flags, fe->pmd);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
@@ -3482,6 +3536,7 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 static int handle_pte_fault(struct fault_env *fe)
 {
 	pte_t entry;
+	struct page *page;
 
 	if (unlikely(pmd_none(*fe->pmd))) {
 		/*
@@ -3533,6 +3588,13 @@ static int handle_pte_fault(struct fault_env *fe)
 	if (pte_protnone(entry) && vma_is_accessible(fe->vma))
 		return do_numa_page(fe, entry);
 
+	/* Catch mapping of un-addressable memory this should never happen */
+	page = pfn_to_page(pte_pfn(entry));
+	if (!is_addressable_page(page)) {
+		print_bad_pte(fe->vma, fe->address, entry, page);
+		return VM_FAULT_SIGBUS;
+	}
+
 	fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
 	spin_lock(fe->ptl);
 	if (unlikely(!pte_same(*fe->pte, entry)))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1bc1eb3..70aff3a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,6 +139,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 
 				pages++;
 			}
+
+			if (is_write_device_entry(entry)) {
+				pte_t newpte;
+
+				make_device_entry_read(&entry);
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				set_pte_at(mm, addr, pte, newpte);
+
+				pages++;
+			}
 		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index 6505a87..0c8696c 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -108,7 +108,8 @@ void *__wrap_devm_memremap_pages(struct device *dev, struct resource *res,
 
 	if (nfit_res)
 		return nfit_res->buf + offset - nfit_res->res->start;
-	return devm_memremap_pages(dev, res, ref, altmap, NULL, NULL);
+	return devm_memremap_pages(dev, res, ref, altmap, NULL,
+				   NULL, NULL, NULL, MEMORY_DEVICE);
 }
 EXPORT_SYMBOL(__wrap_devm_memremap_pages);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 06/16] mm/ZONE_DEVICE/x86: add support for un-addressable device memory
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (4 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 07/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin

It does not need much, just skip populating kernel linear mapping
for range of un-addressable device memory (it is pick so that there
is no physical memory resource overlapping it). All the logic is in
share mm code.

Only support x86-64 as this feature doesn't make much sense with
constrained virtual address space of 32bits architecture.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
---
 arch/x86/mm/init_64.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 442ac86..6e7f613 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -654,7 +654,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 int arch_add_memory(int nid, u64 start, u64 size, int flags)
 {
 	const int supported_flags = MEMORY_DEVICE |
-				    MEMORY_DEVICE_ALLOW_MIGRATE;
+				    MEMORY_DEVICE_ALLOW_MIGRATE |
+				    MEMORY_DEVICE_UNADDRESSABLE;
 	struct pglist_data *pgdat = NODE_DATA(nid);
 	struct zone *zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL,
@@ -669,7 +670,17 @@ int arch_add_memory(int nid, u64 start, u64 size, int flags)
 		return -EINVAL;
 	}
 
-	init_memory_mapping(start, start + size);
+	/*
+	 * We get un-addressable memory when some one is adding a ZONE_DEVICE
+	 * to have struct page for a device memory which is not accessible by
+	 * the CPU so it is pointless to have a linear kernel mapping of such
+	 * memory.
+	 *
+	 * Core mm should make sure it never set a pte pointing to such fake
+	 * physical range.
+	 */
+	if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+		init_memory_mapping(start, start + size);
 
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
@@ -968,7 +979,8 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 int __ref arch_remove_memory(u64 start, u64 size, int flags)
 {
 	const int supported_flags = MEMORY_DEVICE |
-				    MEMORY_DEVICE_ALLOW_MIGRATE;
+				    MEMORY_DEVICE_ALLOW_MIGRATE |
+				    MEMORY_DEVICE_UNADDRESSABLE;
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	struct page *page = pfn_to_page(start_pfn);
@@ -989,7 +1001,9 @@ int __ref arch_remove_memory(u64 start, u64 size, int flags)
 	zone = page_zone(page);
 	ret = __remove_pages(zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
-	kernel_physical_mapping_remove(start, start + size);
+
+	if (!(flags & MEMORY_DEVICE_UNADDRESSABLE))
+		kernel_physical_mapping_remove(start, start + size);
 
 	return ret;
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 07/16] mm/hmm: heterogeneous memory management (HMM for short)
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (5 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 06/16] mm/ZONE_DEVICE/x86: " Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

HMM provides 3 separate functionality :
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

This patch introduces some common helpers and definitions to all of
those 3 functionality.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 MAINTAINERS              |   7 +++
 include/linux/hmm.h      | 150 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |   5 ++
 kernel/fork.c            |   2 +
 mm/Kconfig               |   4 ++
 mm/Makefile              |   1 +
 mm/hmm.c                 |  82 ++++++++++++++++++++++++++
 7 files changed, 251 insertions(+)
 create mode 100644 include/linux/hmm.h
 create mode 100644 mm/hmm.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f593300..41cd63d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5582,6 +5582,13 @@ S:	Supported
 F:	drivers/scsi/hisi_sas/
 F:	Documentation/devicetree/bindings/scsi/hisilicon-sas.txt
 
+HMM - Heterogeneous Memory Management
+M:	Jérôme Glisse <jglisse@redhat.com>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/hmm*
+F:	include/linux/hmm*
+
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
 L:	hostap@shmoo.com (subscribers-only)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
new file mode 100644
index 0000000..f00d519
--- /dev/null
+++ b/include/linux/hmm.h
@@ -0,0 +1,150 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * HMM provides 3 separate functionality :
+ *   - Mirroring: synchronize CPU page table and device page table
+ *   - Device memory: allocating struct page for device memory
+ *   - Migration: migrating regular memory to device memory
+ *
+ * Each can be used independently from the others.
+ *
+ *
+ * Mirroring:
+ *
+ * HMM provide helpers to mirror process address space on a device. For this it
+ * provides several helpers to order device page table update in respect to CPU
+ * page table update. Requirement is that for any given virtual address the CPU
+ * and device page table can not point to different physical page. It uses the
+ * mmu_notifier API behind the scene.
+ *
+ * Device memory:
+ *
+ * HMM provides helpers to help leverage device memory either addressable like
+ * regular memory by the CPU or un-addressable at all. In both case the device
+ * memory is associated to dedicated structs page (which are allocated like for
+ * hotplug memory). Device memory management is under the responsibility of the
+ * device driver. HMM only allocate and initialize the struct pages associated
+ * with the device memory by hotpluging a ZONE_DEVICE memory range.
+ *
+ * Allocating struct page for device memory allow to use device memory allmost
+ * like any regular memory. Unlike regular memory it can not be added to the
+ * lru, nor can any memory allocation can use device memory directly. Device
+ * memory will only end up to be use in a process if device driver migrate some
+ * of the process memory from regular memory to device memory.
+ *
+ *
+ * Migration:
+ *
+ * Existing memory migration mechanism (mm/migrate.c) does not allow to use
+ * something else than the CPU to copy from source to destination memory. More
+ * over existing code is not tailor to drive migration from process virtual
+ * address rather than from list of pages. Finaly the migration flow does not
+ * allow for graceful failure at different step of the migration process.
+ *
+ * HMM solves all of the above through simple API :
+ *
+ *      hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private);
+ *
+ * With ops struct providing 2 callback alloc_and_copy() which allocated the
+ * destination memory and initialize it using source memory. Migration can fail
+ * after this step and thus last callback finalize_and_map() allow the device
+ * driver to know which page were successfully migrated and which were not.
+ *
+ * This can easily be use outside of HMM intended use case.
+ *
+ *
+ * This header file contain all the API related to this 3 functionality and
+ * each functions and struct are more thoroughly documented in below comments.
+ */
+#ifndef LINUX_HMM_H
+#define LINUX_HMM_H
+
+#include <linux/kconfig.h>
+
+#if IS_ENABLED(CONFIG_HMM)
+
+
+/*
+ * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
+ *
+ * Flags:
+ * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_WRITE: CPU page table have the write permission set
+ */
+typedef unsigned long hmm_pfn_t;
+
+#define HMM_PFN_VALID (1 << 0)
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_SHIFT 2
+
+/*
+ * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
+ * @pfn: hmm_pfn_t to convert to struct page
+ * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ *
+ * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value store in the hmm_pfn_t. Otherwise return NULL.
+ */
+static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return NULL;
+	return pfn_to_page(pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_to_pfn() - return pfn value store in a hmm_pfn_t
+ * @pfn: hmm_pfn_t to extract pfn from
+ * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ */
+static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
+{
+	if (!(pfn & HMM_PFN_VALID))
+		return -1UL;
+	return (pfn >> HMM_PFN_SHIFT);
+}
+
+/*
+ * hmm_pfn_from_page() - create a valid hmm_pfn_t value from struct page
+ * @page: struct page pointer for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the page
+ */
+static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
+{
+	return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+/*
+ * hmm_pfn_from_pfn() - create a valid hmm_pfn_t value from pfn
+ * @pfn: pfn value for which to create the hmm_pfn_t
+ * Returns: valid hmm_pfn_t for the pfn
+ */
+static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
+{
+	return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
+}
+
+
+/* Below are for HMM internal use only ! Not to be used by device driver ! */
+void hmm_mm_destroy(struct mm_struct *mm);
+
+#else /* IS_ENABLED(CONFIG_HMM) */
+
+/* Below are for HMM internal use only ! Not to be used by device driver ! */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+
+#endif /* IS_ENABLED(CONFIG_HMM) */
+#endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a8aced..4effdbf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -23,6 +23,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct hmm;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -516,6 +517,10 @@ struct mm_struct {
 	atomic_long_t hugetlb_usage;
 #endif
 	struct work_struct async_put_work;
+#if IS_ENABLED(CONFIG_HMM)
+	/* HMM need to track few things per mm */
+	struct hmm *hmm;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index 690a1aad..af0eec8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -27,6 +27,7 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
+#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
@@ -702,6 +703,7 @@ void __mmdrop(struct mm_struct *mm)
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	free_mm(mm);
diff --git a/mm/Kconfig b/mm/Kconfig
index 8564a5f..2f6a69f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -289,6 +289,10 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	bool
 
+config HMM
+	bool
+	depends on MMU
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/Makefile b/mm/Makefile
index 2ca1faf..6ac1284 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_HMM) += hmm.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/hmm.c b/mm/hmm.c
new file mode 100644
index 0000000..e891fdd
--- /dev/null
+++ b/mm/hmm.c
@@ -0,0 +1,82 @@
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * Refer to include/linux/hmm.h for information about heterogeneous memory
+ * management or HMM for short.
+ */
+#include <linux/mm.h>
+#include <linux/hmm.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+/*
+ * struct hmm - HMM per mm struct
+ *
+ * @mm: mm struct this HMM struct is bound to
+ */
+struct hmm {
+	struct mm_struct	*mm;
+};
+
+/*
+ * hmm_register - register HMM against an mm (HMM internal)
+ *
+ * @mm: mm struct to attach to
+ *
+ * This is not intended to be use directly by device driver but by other HMM
+ * component. It allocates an HMM struct if mm does not have one and initialize
+ * it.
+ */
+static struct hmm *hmm_register(struct mm_struct *mm)
+{
+	if (!mm->hmm) {
+		struct hmm *hmm = NULL;
+
+		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
+		if (!hmm)
+			return NULL;
+		hmm->mm = mm;
+
+		spin_lock(&mm->page_table_lock);
+		if (!mm->hmm)
+			mm->hmm = hmm;
+		else
+			kfree(hmm);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	/*
+	 * The hmm struct can only be free once mm_struct goes away
+	 * hence we should always have pre-allocated an new hmm struct
+	 * above.
+	 */
+	return mm->hmm;
+}
+
+void hmm_mm_destroy(struct mm_struct *mm)
+{
+	struct hmm *hmm;
+
+	/*
+	 * We should not need to lock here as no one should be able to register
+	 * a new HMM while an mm is being destroy. But just to be safe ...
+	 */
+	spin_lock(&mm->page_table_lock);
+	hmm = mm->hmm;
+	mm->hmm = NULL;
+	spin_unlock(&mm->page_table_lock);
+	kfree(hmm);
+}
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (6 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 07/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 09/16] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This is a heterogeneous memory management (HMM) process address space
mirroring. In a nutshell this provide an API to mirror process address
space on a device. This boils down to keeping CPU and device page table
synchronize (we assume that both device and CPU are cache coherent like
PCIe device can be).

This patch provide a simple API for device driver to achieve address
space mirroring thus avoiding each device driver to grow its own CPU
page table walker and its own CPU page table synchronization mechanism.

This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
hardware in the future.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 101 ++++++++++++++++++++++++++++
 mm/Kconfig          |  15 +++++
 mm/hmm.c            | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 301 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f00d519..31e2c50 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,7 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+struct hmm;
 
 /*
  * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
@@ -138,6 +139,106 @@ static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
 }
 
 
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+/*
+ * Mirroring: how to use synchronize device page table with CPU page table ?
+ *
+ * Device driver must always synchronize with CPU page table update, for this
+ * they can either directly use mmu_notifier API or they can use the hmm_mirror
+ * API. Device driver can decide to register one mirror per device per process
+ * or just one mirror per process for a group of device. Pattern is :
+ *
+ *      int device_bind_address_space(..., struct mm_struct *mm, ...)
+ *      {
+ *          struct device_address_space *das;
+ *          int ret;
+ *          // Device driver specific initialization, and allocation of das
+ *          // which contain an hmm_mirror struct as one of its field.
+ *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
+ *          if (ret) {
+ *              // Cleanup on error
+ *              return ret;
+ *          }
+ *          // Other device driver specific initialization
+ *      }
+ *
+ * Device driver must not free the struct containing hmm_mirror struct before
+ * calling hmm_mirror_unregister() expected usage is to do that when device
+ * driver is unbinding from an address space.
+ *
+ *      void device_unbind_address_space(struct device_address_space *das)
+ *      {
+ *          // Device driver specific cleanup
+ *          hmm_mirror_unregister(&das->mirror);
+ *          // Other device driver specific cleanup and now das can be free
+ *      }
+ *
+ * Once an hmm_mirror is register for an address space, device driver will get
+ * callback through the update() operation (see hmm_mirror_ops struct).
+ */
+
+struct hmm_mirror;
+
+/*
+ * enum hmm_update - type of update
+ * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
+ */
+enum hmm_update {
+	HMM_UPDATE_INVALIDATE,
+};
+
+/*
+ * struct hmm_mirror_ops - HMM mirror device operations callback
+ *
+ * @update: callback to update range on a device
+ */
+struct hmm_mirror_ops {
+	/* update() - update virtual address range of memory
+	 *
+	 * @mirror: pointer to struct hmm_mirror
+	 * @update: update's type (turn read only, unmap, ...)
+	 * @start: virtual start address of the range to update
+	 * @end: virtual end address of the range to update
+	 *
+	 * This callback is call when the CPU page table is updated, the device
+	 * driver must update device page table accordingly to update's action.
+	 *
+	 * Device driver callback must wait until device have fully updated its
+	 * view for the range. Note we plan to make this asynchronous in later
+	 * patches. So that multiple devices can schedule update to their page
+	 * table and once all device have schedule the update then we wait for
+	 * them to propagate.
+	 */
+	void (*update)(struct hmm_mirror *mirror,
+		       enum hmm_update action,
+		       unsigned long start,
+		       unsigned long end);
+};
+
+/*
+ * struct hmm_mirror - mirror struct for a device driver
+ *
+ * @hmm: pointer to struct hmm (which is unique per mm_struct)
+ * @ops: device driver callback for HMM mirror operations
+ * @list: for list of mirrors of a given mm
+ *
+ * Each address space (mm_struct) being mirrored by a device must register one
+ * of hmm_mirror struct with HMM. HMM will track list of all mirrors for each
+ * mm_struct (or each process).
+ */
+struct hmm_mirror {
+	struct hmm			*hmm;
+	const struct hmm_mirror_ops	*ops;
+	struct list_head		list;
+};
+
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
+int hmm_mirror_register_locked(struct hmm_mirror *mirror,
+			       struct mm_struct *mm);
+void hmm_mirror_unregister(struct hmm_mirror *mirror);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 2f6a69f..7dd4ca3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -293,6 +293,21 @@ config HMM
 	bool
 	depends on MMU
 
+config HMM_MIRROR
+	bool "HMM mirror CPU page table into a device page table"
+	select HMM
+	select MMU_NOTIFIER
+	help
+	  HMM mirror is a set of helpers to mirror CPU page table into a device
+	  page table. There is two side, first keep both page table synchronize
+	  so that no virtual address can point to different page (but one page
+	  table might lag ie onee might still point to page while the other is
+	  is pointing to nothing).
+
+	  Second side of the equation is replicating CPU page table content for
+	  range of virtual address. This require careful synchronization with
+	  CPU page table update.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index e891fdd..b725c6d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -21,14 +21,27 @@
 #include <linux/hmm.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmu_notifier.h>
 
 /*
  * struct hmm - HMM per mm struct
  *
  * @mm: mm struct this HMM struct is bound to
+ * @lock: lock protecting mirrors list
+ * @mirrors: list of mirrors for this mm
+ * @wait_queue: wait queue
+ * @sequence: we track update to CPU page table with a sequence number
+ * @mmu_notifier: mmu notifier to track update to CPU page table
+ * @notifier_count: number of currently active notifier count
  */
 struct hmm {
 	struct mm_struct	*mm;
+	spinlock_t		lock;
+	struct list_head	mirrors;
+	atomic_t		sequence;
+	wait_queue_head_t	wait_queue;
+	struct mmu_notifier	mmu_notifier;
+	atomic_t		notifier_count;
 };
 
 /*
@@ -48,6 +61,12 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 		if (!hmm)
 			return NULL;
+		init_waitqueue_head(&hmm->wait_queue);
+		atomic_set(&hmm->notifier_count, 0);
+		INIT_LIST_HEAD(&hmm->mirrors);
+		atomic_set(&hmm->sequence, 0);
+		hmm->mmu_notifier.ops = NULL;
+		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 
 		spin_lock(&mm->page_table_lock);
@@ -80,3 +99,169 @@ void hmm_mm_destroy(struct mm_struct *mm)
 	spin_unlock(&mm->page_table_lock);
 	kfree(hmm);
 }
+
+
+#if IS_ENABLED(CONFIG_HMM_MIRROR)
+static void hmm_invalidate_range(struct hmm *hmm,
+				 enum hmm_update action,
+				 unsigned long start,
+				 unsigned long end)
+{
+	struct hmm_mirror *mirror;
+
+	/*
+	 * Mirror being added or remove is a rare event so list traversal isn't
+	 * protected by a lock, we rely on simple rules. All list modification
+	 * are done using list_add_rcu() and list_del_rcu() under a spinlock to
+	 * protect from concurrent addition or removal but not traversal.
+	 *
+	 * Because hmm_mirror_unregister() wait for all running invalidation to
+	 * complete (and thus all list traversal to finish). None of the mirror
+	 * struct can be freed from under us while traversing the list and thus
+	 * it is safe to dereference their list pointer even if they were just
+	 * remove.
+	 */
+	list_for_each_entry (mirror, &hmm->mirrors, list)
+		mirror->ops->update(mirror, action, start, end);
+}
+
+static void hmm_invalidate_page(struct mmu_notifier *mn,
+				struct mm_struct *mm,
+				unsigned long addr)
+{
+	unsigned long start = addr & PAGE_MASK;
+	unsigned long end = start + PAGE_SIZE;
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long start,
+				       unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	atomic_inc(&hmm->notifier_count);
+	atomic_inc(&hmm->sequence);
+	hmm_invalidate_range(mm->hmm, HMM_UPDATE_INVALIDATE, start, end);
+}
+
+static void hmm_invalidate_range_end(struct mmu_notifier *mn,
+				     struct mm_struct *mm,
+				     unsigned long start,
+				     unsigned long end)
+{
+	struct hmm *hmm = mm->hmm;
+
+	VM_BUG_ON(!hmm);
+
+	/* Reverse order here because we are getting out of invalidation */
+	atomic_dec(&hmm->notifier_count);
+	wake_up(&hmm->wait_queue);
+}
+
+static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+	.invalidate_page	= hmm_invalidate_page,
+	.invalidate_range_start	= hmm_invalidate_range_start,
+	.invalidate_range_end	= hmm_invalidate_range_end,
+};
+
+static int hmm_mirror_do_register(struct hmm_mirror *mirror,
+				  struct mm_struct *mm,
+				  const bool locked)
+{
+	/* Sanity check */
+	if (!mm || !mirror || !mirror->ops)
+		return -EINVAL;
+
+	mirror->hmm = hmm_register(mm);
+	if (!mirror->hmm)
+		return -ENOMEM;
+
+	/* Register mmu_notifier if not already, use mmap_sem for locking */
+	if (!mirror->hmm->mmu_notifier.ops) {
+		struct hmm *hmm = mirror->hmm;
+
+		if (!locked)
+			down_write(&mm->mmap_sem);
+		if (!hmm->mmu_notifier.ops) {
+			hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
+			if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+				hmm->mmu_notifier.ops = NULL;
+				up_write(&mm->mmap_sem);
+				return -ENOMEM;
+			}
+		}
+		if (!locked)
+			up_write(&mm->mmap_sem);
+	}
+
+	spin_lock(&mirror->hmm->lock);
+	list_add_rcu(&mirror->list, &mirror->hmm->mirrors);
+	spin_unlock(&mirror->hmm->lock);
+
+	return 0;
+}
+
+/*
+ * hmm_mirror_register() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * To start mirroring a process address space device driver must register an
+ * HMM mirror struct.
+ */
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return hmm_mirror_do_register(mirror, mm, false);
+}
+EXPORT_SYMBOL(hmm_mirror_register);
+
+/*
+ * hmm_mirror_register_locked() - register a mirror against an mm
+ *
+ * @mirror: new mirror struct to register
+ * @mm: mm to register against
+ *
+ * Same as hmm_mirror_register() except that mmap_sem must write locked !
+ */
+int hmm_mirror_register_locked(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return hmm_mirror_do_register(mirror, mm, true);
+}
+EXPORT_SYMBOL(hmm_mirror_register_locked);
+
+/*
+ * hmm_mirror_unregister() - unregister a mirror
+ *
+ * @mirror: new mirror struct to register
+ *
+ * Stop mirroring a process address space and cleanup.
+ */
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+	struct hmm *hmm = mirror->hmm;
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&mirror->list);
+	spin_unlock(&hmm->lock);
+
+	/*
+	 * Wait for all active notifier so that it is safe to traverse mirror
+	 * list without any lock.
+	 */
+	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
+}
+EXPORT_SYMBOL(hmm_mirror_unregister);
+#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 09/16] mm/hmm/mirror: helper to snapshot CPU page table
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (7 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 10/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This does not use existing page table walker because we want to share
same code for our page fault handler.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  56 +++++++++++-
 mm/hmm.c            | 257 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 311 insertions(+), 2 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 31e2c50..b5eafdc 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -83,13 +83,28 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
+ * HMM_PFN_READ: read permission set
  * HMM_PFN_WRITE: CPU page table have the write permission set
+ * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
+ * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
+ * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
+ *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
+ *      device (the entry will never have HMM_PFN_VALID set and the pfn value
+ *      is undefine)
+ * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_WRITE (1 << 1)
-#define HMM_PFN_SHIFT 2
+#define HMM_PFN_READ (1 << 1)
+#define HMM_PFN_WRITE (1 << 2)
+#define HMM_PFN_ERROR (1 << 3)
+#define HMM_PFN_EMPTY (1 << 4)
+#define HMM_PFN_DEVICE (1 << 5)
+#define HMM_PFN_SPECIAL (1 << 6)
+#define HMM_PFN_UNADDRESSABLE (1 << 7)
+#define HMM_PFN_SHIFT 8
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -236,6 +251,43 @@ int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 int hmm_mirror_register_locked(struct hmm_mirror *mirror,
 			       struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
+
+
+/*
+ * struct hmm_range - track invalidation lock on virtual address range
+ *
+ * @list: all range lock are on a list
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @pfns: array of pfns (big enough for the range)
+ * @valid: pfns array did not change since it has been fill by an HMM function
+ */
+struct hmm_range {
+	struct list_head	list;
+	unsigned long		start;
+	unsigned long		end;
+	hmm_pfn_t		*pfns;
+	bool			valid;
+};
+
+/*
+ * To snapshot CPU page table call hmm_vma_get_pfns() then take device driver
+ * lock that serialize device page table update and call hmm_vma_range_done()
+ * to check if snapshot is still valid. The device driver page table update
+ * lock must also be use in the HMM mirror update() callback so that CPU page
+ * table invalidation serialize on it.
+ *
+ * YOU MUST CALL hmm_vma_range_dond() ONCE AND ONLY ONCE EACH TIME YOU CALL
+ * hmm_vma_get_pfns() WITHOUT ERROR !
+ *
+ * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns);
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index b725c6d..0ef06df 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -19,10 +19,15 @@
  */
 #include <linux/mm.h>
 #include <linux/hmm.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/swapops.h>
+#include <linux/hugetlb.h>
 #include <linux/mmu_notifier.h>
 
+
 /*
  * struct hmm - HMM per mm struct
  *
@@ -37,6 +42,7 @@
 struct hmm {
 	struct mm_struct	*mm;
 	spinlock_t		lock;
+	struct list_head	ranges;
 	struct list_head	mirrors;
 	atomic_t		sequence;
 	wait_queue_head_t	wait_queue;
@@ -66,6 +72,7 @@ static struct hmm *hmm_register(struct mm_struct *mm)
 		INIT_LIST_HEAD(&hmm->mirrors);
 		atomic_set(&hmm->sequence, 0);
 		hmm->mmu_notifier.ops = NULL;
+		INIT_LIST_HEAD(&hmm->ranges);
 		spin_lock_init(&hmm->lock);
 		hmm->mm = mm;
 
@@ -108,6 +115,22 @@ static void hmm_invalidate_range(struct hmm *hmm,
 				 unsigned long end)
 {
 	struct hmm_mirror *mirror;
+	struct hmm_range *range;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(range, &hmm->ranges, list) {
+		unsigned long addr, idx, npages;
+
+		if (end < range->start || start >= range->end)
+			continue;
+
+		range->valid = false;
+		addr = max(start, range->start);
+		idx = (addr - range->start) >> PAGE_SHIFT;
+		npages = (min(range->end, end) - addr) >> PAGE_SHIFT;
+		memset(&range->pfns[idx], 0, sizeof(*range->pfns) * npages);
+	}
+	rcu_read_unlock();
 
 	/*
 	 * Mirror being added or remove is a rare event so list traversal isn't
@@ -264,4 +287,238 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	wait_event(hmm->wait_queue, !atomic_read(&hmm->notifier_count));
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
+
+static void hmm_pfns_empty(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_EMPTY;
+}
+
+static void hmm_pfns_special(hmm_pfn_t *pfns,
+			     unsigned long addr,
+			     unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_SPECIAL;
+}
+
+static void hmm_vma_walk(struct vm_area_struct *vma,
+			 unsigned long start,
+			 unsigned long end,
+			 hmm_pfn_t *pfns)
+{
+	unsigned long addr, next;
+	hmm_pfn_t flag;
+
+	flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
+
+	for (addr = start; addr < end; addr = next) {
+		unsigned long i = (addr - start) >> PAGE_SHIFT;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+		pmd_t pmd;
+
+		/*
+		 * We are accessing/faulting for a device from an unknown
+		 * thread that might be foreign to the mm we are faulting
+		 * against so do not call arch_vma_access_permitted() !
+		 */
+
+		next = pgd_addr_end(addr, end);
+		pgdp = pgd_offset(vma->vm_mm, addr);
+		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pud_addr_end(addr, end);
+		pudp = pud_offset(pgdp, addr);
+		if (pud_none(*pudp) || pud_bad(*pudp)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+
+		next = pmd_addr_end(addr, end);
+		pmdp = pmd_offset(pudp, addr);
+		pmd = pmd_read_atomic(pmdp);
+		barrier();
+		if (pmd_none(pmd) || pmd_bad(pmd)) {
+			hmm_pfns_empty(&pfns[i], addr, next);
+			continue;
+		}
+		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
+			hmm_pfn_t flags = flag;
+
+			if (pmd_protnone(pmd)) {
+				hmm_pfns_clear(&pfns[i], addr, next);
+				continue;
+			}
+			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
+			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
+				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
+			continue;
+		}
+
+		ptep = pte_offset_map(pmdp, addr);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			pte_t pte = *ptep;
+
+			pfns[i] = 0;
+
+			if (pte_none(pte)) {
+				pfns[i] = HMM_PFN_EMPTY;
+				continue;
+			}
+
+			entry = pte_to_swp_entry(pte);
+			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				continue;
+			}
+
+			if (pte_present(pte)) {
+				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
+				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
+				continue;
+			}
+
+			/*
+			 * This is a special swap entry, ignore migration, use
+			 * device and report anything else as error.
+			*/
+			if (is_device_entry(entry)) {
+				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
+				if (is_write_device_entry(entry))
+					pfns[i] |= HMM_PFN_WRITE;
+				pfns[i] |= HMM_PFN_DEVICE;
+				pfns[i] |= HMM_PFN_UNADDRESSABLE;
+				pfns[i] |= flag;
+			} else if (!is_migration_entry(entry)) {
+				pfns[i] = HMM_PFN_ERROR;
+			}
+		}
+		pte_unmap(ptep - 1);
+	}
+}
+
+/*
+ * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual address
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track snapshot validity
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @entries: array of hmm_pfn_t provided by caller fill by function
+ * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
+ *
+ * This snapshot the CPU page table for a range of virtual address, snapshot
+ * validity is track by the range struct see hmm_vma_range_done() for further
+ * informations.
+ *
+ * The range struct is initialized and track CPU page table only if function
+ * returns success (0) then you must call hmm_vma_range_done() to stop range
+ * CPU page table update tracking.
+ *
+ * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
+ * MEMORY CORRUPTION ! YOU HAVE BEEN WARN !
+ */
+int hmm_vma_get_pfns(struct vm_area_struct *vma,
+		     struct hmm_range *range,
+		     unsigned long start,
+		     unsigned long end,
+		     hmm_pfn_t *pfns)
+{
+	struct hmm *hmm;
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return -EINVAL;
+	}
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm)
+		return -ENOMEM;
+	/* Caller must have register a mirror (with hmm_mirror_register()) ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	hmm_vma_walk(vma, start, end, pfns);
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_get_pfns);
+
+/*
+ * hmm_vma_range_done() - stop tracking change to CPU page table over a range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: range being track
+ * Returns: false if range data have been invalidated, true otherwise
+ *
+ * Range struct is use to track update to CPU page table after call to
+ * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
+ * to data it gots from this function it calls hmm_vma_range_done() which stop
+ * the tracking.
+ *
+ * There is 2 way to use this :
+ * again:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   trans = device_build_page_table_update_transaction(pfns);
+ *   device_page_table_lock();
+ *   if (!hmm_vma_range_done(vma, range)) {
+ *     device_page_table_unlock();
+ *     goto again;
+ *   }
+ *   device_commit_transaction(trans);
+ *   device_page_table_unlock();
+ *
+ * Or:
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   device_page_table_lock();
+ *   hmm_vma_range_done(vma, range);
+ *   device_update_page_table(pfns);
+ *   device_page_table_unlock();
+ */
+bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
+{
+	unsigned long npages = (range->end - range->start) >> PAGE_SHIFT;
+	struct hmm *hmm;
+
+	if (range->end <= range->start) {
+		BUG();
+		return false;
+	}
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		memset(range->pfns, 0, sizeof(*range->pfns) * npages);
+		return false;
+	}
+
+	spin_lock(&hmm->lock);
+	list_del_rcu(&range->list);
+	spin_unlock(&hmm->lock);
+
+	return range->valid;
+}
+EXPORT_SYMBOL(hmm_vma_range_done);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 10/16] mm/hmm/mirror: device page fault handler
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (8 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 09/16] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This handle page fault on behalf of device driver, unlike handle_mm_fault()
it does not trigger migration back to system memory for device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  26 +++++
 mm/hmm.c            | 269 ++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 267 insertions(+), 28 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b5eafdc..f19c2a0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -288,6 +288,32 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 		     unsigned long end,
 		     hmm_pfn_t *pfns);
 bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+
+
+/*
+ * Fault memory on behalf of device driver unlike handle_mm_fault() it will not
+ * migrate any device memory back to system memory. The hmm_pfn_t array will be
+ * updated with fault result and current snapshot of the CPU page table for the
+ * range.
+ *
+ * The mmap_sem must be taken in read mode before entering and it might be drop
+ * by the function if block argument is false, when that happen the function
+ * returns -EAGAIN.
+ *
+ * Return value does not reflect if the fault was successfull for every single
+ * address or not, you need to inspect the hmm_pfn_t array to determine fault
+ * status for that address. Trying to fault inside an invalid vma will result
+ * in -EINVAL.
+ *
+ * See function description in mm/hmm.c for documentation.
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 0ef06df..a397d45 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -288,6 +288,15 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
+
+static void hmm_pfns_error(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	for (; addr < end; addr += PAGE_SIZE, pfns++)
+		*pfns = HMM_PFN_ERROR;
+}
+
 static void hmm_pfns_empty(hmm_pfn_t *pfns,
 			   unsigned long addr,
 			   unsigned long end)
@@ -304,10 +313,43 @@ static void hmm_pfns_special(hmm_pfn_t *pfns,
 		*pfns = HMM_PFN_SPECIAL;
 }
 
-static void hmm_vma_walk(struct vm_area_struct *vma,
-			 unsigned long start,
-			 unsigned long end,
-			 hmm_pfn_t *pfns)
+static void hmm_pfns_clear(hmm_pfn_t *pfns,
+			   unsigned long addr,
+			   unsigned long end)
+{
+	unsigned long npfns = (end - addr) >> PAGE_SHIFT;
+
+	memset(pfns, 0, sizeof(*pfns) * npfns);
+}
+
+static int hmm_vma_do_fault(struct vm_area_struct *vma,
+			    const hmm_pfn_t fault,
+			    unsigned long addr,
+			    hmm_pfn_t *pfn,
+			    bool block)
+{
+	unsigned flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
+	int r;
+
+	flags |= block ? 0 : FAULT_FLAG_ALLOW_RETRY;
+	flags |= (fault & HMM_PFN_WRITE) ? FAULT_FLAG_WRITE : 0;
+	r = handle_mm_fault(vma, addr, flags);
+	if (r & VM_FAULT_RETRY)
+		return -EAGAIN;
+	if (r & VM_FAULT_ERROR) {
+		*pfn = HMM_PFN_ERROR;
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int hmm_vma_walk(struct vm_area_struct *vma,
+			const hmm_pfn_t fault,
+			unsigned long start,
+			unsigned long end,
+			hmm_pfn_t *pfns,
+			bool block)
 {
 	unsigned long addr, next;
 	hmm_pfn_t flag;
@@ -321,6 +363,7 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd_t *pmdp;
 		pte_t *ptep;
 		pmd_t pmd;
+		int ret;
 
 		/*
 		 * We are accessing/faulting for a device from an unknown
@@ -331,15 +374,37 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		next = pgd_addr_end(addr, end);
 		pgdp = pgd_offset(vma->vm_mm, addr);
 		if (pgd_none(*pgdp) || pgd_bad(*pgdp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pudp = pud_alloc(vma->vm_mm, pgdp, addr);
+			if (!pudp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pud_addr_end(addr, end);
 		pudp = pud_offset(pgdp, addr);
 		if (pud_none(*pudp) || pud_bad(*pudp)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			pmdp = pmd_alloc(vma->vm_mm, pudp, addr);
+			if (!pmdp) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
 		}
 
 		next = pmd_addr_end(addr, end);
@@ -347,8 +412,24 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 		pmd = pmd_read_atomic(pmdp);
 		barrier();
 		if (pmd_none(pmd) || pmd_bad(pmd)) {
-			hmm_pfns_empty(&pfns[i], addr, next);
-			continue;
+			if (!(vma->vm_flags & VM_READ)) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			if (!fault) {
+				hmm_pfns_empty(&pfns[i], addr, next);
+				continue;
+			}
+			/*
+			 * Use pte_alloc() instead of pte_alloc_map, because we
+			 * can't run pte_offset_map on the pmd, if an huge pmd
+			 * could materialize from under us.
+			 */
+			if (unlikely(pte_alloc(vma->vm_mm, pmdp, addr))) {
+				hmm_pfns_error(&pfns[i], addr, next);
+				continue;
+			}
+			pmd = *pmdp;
 		}
 		if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
 			unsigned long pfn = pmd_pfn(pmd) + pte_index(addr);
@@ -356,10 +437,14 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 
 			if (pmd_protnone(pmd)) {
 				hmm_pfns_clear(&pfns[i], addr, next);
+				if (fault)
+					goto fault;
 				continue;
 			}
 			flags |= pmd_write(*pmdp) ? HMM_PFN_WRITE : 0;
 			flags |= pmd_devmap(pmd) ? HMM_PFN_DEVICE : 0;
+			if ((flags & fault) != fault)
+				goto fault;
 			for (; addr < next; addr += PAGE_SIZE, i++, pfn++)
 				pfns[i] = hmm_pfn_from_pfn(pfn) | flags;
 			continue;
@@ -370,41 +455,63 @@ static void hmm_vma_walk(struct vm_area_struct *vma,
 			swp_entry_t entry;
 			pte_t pte = *ptep;
 
-			pfns[i] = 0;
-
 			if (pte_none(pte)) {
+				if (fault) {
+					pte_unmap(ptep);
+					goto fault;
+				}
 				pfns[i] = HMM_PFN_EMPTY;
 				continue;
 			}
 
 			entry = pte_to_swp_entry(pte);
 			if (!pte_present(pte) && !non_swap_entry(entry)) {
+				if (fault) {
+					pte_unmap(ptep);
+					goto fault;
+				}
+				pfns[i] = 0;
 				continue;
 			}
 
 			if (pte_present(pte)) {
 				pfns[i] = hmm_pfn_from_pfn(pte_pfn(pte))|flag;
 				pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
-				continue;
-			}
-
-			/*
-			 * This is a special swap entry, ignore migration, use
-			 * device and report anything else as error.
-			*/
-			if (is_device_entry(entry)) {
+			} else if (is_device_entry(entry)) {
+				/* Do not fault device entry */
 				pfns[i] = hmm_pfn_from_pfn(swp_offset(entry));
 				if (is_write_device_entry(entry))
 					pfns[i] |= HMM_PFN_WRITE;
 				pfns[i] |= HMM_PFN_DEVICE;
 				pfns[i] |= HMM_PFN_UNADDRESSABLE;
 				pfns[i] |= flag;
-			} else if (!is_migration_entry(entry)) {
+			} else if (is_migration_entry(entry) && fault) {
+				migration_entry_wait(vma->vm_mm, pmdp, addr);
+				/* Start again for current address */
+				next = addr;
+				ptep++;
+				break;
+			} else {
+				/* Report error for everything else */
 				pfns[i] = HMM_PFN_ERROR;
 			}
+			if ((fault & pfns[i]) != fault) {
+				pte_unmap(ptep);
+				goto fault;
+			}
 		}
 		pte_unmap(ptep - 1);
+		continue;
+
+fault:
+		ret = hmm_vma_do_fault(vma, fault, addr, &pfns[i], block);
+		if (ret)
+			return ret;
+		/* Start again for current address */
+		next = addr;
 	}
+
+	return 0;
 }
 
 /*
@@ -463,7 +570,7 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
 	list_add_rcu(&range->list, &hmm->ranges);
 	spin_unlock(&hmm->lock);
 
-	hmm_vma_walk(vma, start, end, pfns);
+	hmm_vma_walk(vma, 0, start, end, pfns, false);
 	return 0;
 }
 EXPORT_SYMBOL(hmm_vma_get_pfns);
@@ -474,14 +581,22 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  * @range: range being track
  * Returns: false if range data have been invalidated, true otherwise
  *
- * Range struct is use to track update to CPU page table after call to
- * hmm_vma_get_pfns(). Once device driver is done using or want to lock update
- * to data it gots from this function it calls hmm_vma_range_done() which stop
- * the tracking.
+ * Range struct is use to track update to CPU page table after call to either
+ * hmm_vma_get_pfns() or hmm_vma_fault(). Once device driver is done using or
+ * want to lock update to data it gots from those functions it must call the
+ * hmm_vma_range_done() function which stop tracking CPU page table update.
+ *
+ * Note that device driver must still implement general CPU page table update
+ * tracking either by using hmm_mirror (see hmm_mirror_register()) or by using
+ * mmu_notifier API directly.
+ *
+ * CPU page table update tracking done through hmm_range is only temporary and
+ * to be use while trying to duplicate CPU page table content for a range of
+ * virtual address.
  *
  * There is 2 way to use this :
  * again:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   trans = device_build_page_table_update_transaction(pfns);
  *   device_page_table_lock();
  *   if (!hmm_vma_range_done(vma, range)) {
@@ -492,7 +607,7 @@ EXPORT_SYMBOL(hmm_vma_get_pfns);
  *   device_page_table_unlock();
  *
  * Or:
- *   hmm_vma_get_pfns(vma, range, start, end, pfns);
+ *   hmm_vma_get_pfns(vma, range, start, end, pfns); or hmm_vma_fault(...);
  *   device_page_table_lock();
  *   hmm_vma_range_done(vma, range);
  *   device_update_page_table(pfns);
@@ -521,4 +636,102 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range)
 	return range->valid;
 }
 EXPORT_SYMBOL(hmm_vma_range_done);
+
+/*
+ * hmm_vma_fault() - try to fault some address in a virtual address range
+ * @vma: virtual memory area containing the virtual address range
+ * @range: use to track pfns array content validity
+ * @start: fault range virtual start address (inclusive)
+ * @end: fault range virtual end address (exclusive)
+ * @pfns: array of hmm_pfn_t, only entry with fault flag set will be faulted
+ * @write: is it a write fault
+ * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem)
+ * Returns: 0 success, error otherwise (-EAGAIN means mmap_sem have been drop)
+ *
+ * This is similar to a regular CPU page fault except that it will not trigger
+ * any memory migration if the memory being faulted is not accessible by CPUs.
+ *
+ * On error, for one virtual address in the range, the function will set the
+ * hmm_pfn_t error flag for the corresponding pfn entry.
+ *
+ * Expected use pattern:
+ * retry:
+ *   down_read(&mm->mmap_sem);
+ *   // Find vma and address device wants to fault, initialize hmm_pfn_t
+ *   // array accordingly
+ *   ret = hmm_vma_fault(vma, start, end, pfns, allow_retry);
+ *   switch (ret) {
+ *   case -EAGAIN:
+ *     hmm_vma_range_done(vma, range);
+ *     // You might want to rate limit or yield to play nicely, you may
+ *     // also commit any valid pfn in the array assuming that you are
+ *     // getting true from hmm_vma_range_monitor_end()
+ *     goto retry;
+ *   case 0:
+ *     break;
+ *   default:
+ *     // Handle error !
+ *     up_read(&mm->mmap_sem)
+ *     return;
+ *   }
+ *   // Take device driver lock that serialize device page table update
+ *   driver_lock_device_page_table_update();
+ *   hmm_vma_range_done(vma, range);
+ *   // Commit pfns we got from hmm_vma_fault()
+ *   driver_unlock_device_page_table_update();
+ *   up_read(&mm->mmap_sem)
+ *
+ * YOU MUST CALL hmm_vma_range_done() AFTER THIS FUNCTION RETURN SUCCESS (0)
+ * BEFORE FREEING THE range struct OR YOU WILL HAVE SERIOUS MEMORY CORRUPTION !
+ *
+ * YOU HAVE BEEN WARN !
+ */
+int hmm_vma_fault(struct vm_area_struct *vma,
+		  struct hmm_range *range,
+		  unsigned long start,
+		  unsigned long end,
+		  hmm_pfn_t *pfns,
+		  bool write,
+		  bool block)
+{
+	hmm_pfn_t fault = HMM_PFN_READ | (write ? HMM_PFN_WRITE : 0);
+	struct hmm *hmm;
+	int ret;
+
+	/* Sanity check, this really should not happen ! */
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end < vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	hmm = hmm_register(vma->vm_mm);
+	if (!hmm) {
+		hmm_pfns_clear(pfns, start, end);
+		return -ENOMEM;
+	}
+	/* Caller must have register a mirror (with hmm_mirror_register()) ! */
+	if (!hmm->mmu_notifier.ops)
+		return -EINVAL;
+
+	/* Initialize range to track CPU page table update */
+	range->start = start;
+	range->pfns = pfns;
+	range->end = end;
+	spin_lock(&hmm->lock);
+	range->valid = true;
+	list_add_rcu(&range->list, &hmm->ranges);
+	spin_unlock(&hmm->lock);
+
+	/* FIXME support hugetlb fs */
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+		hmm_pfns_special(pfns, start, end);
+		return 0;
+	}
+
+	ret = hmm_vma_walk(vma, fault, start, end, pfns, block);
+	if (ret)
+		hmm_vma_range_done(vma, range);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (9 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 10/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Allow to unmap and restore special swap entry of un-addressable
ZONE_DEVICE memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/migrate.c | 11 ++++++++++-
 mm/rmap.c    | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 66ce6b4..6b6b457 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -40,6 +40,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -248,7 +249,15 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		pte = arch_make_huge_pte(pte, vma, new, 0);
 	}
 #endif
-	flush_dcache_page(new);
+
+	if (unlikely(is_zone_device_page(new)) && !is_addressable_page(new)) {
+		entry = make_device_entry(new, pte_write(pte));
+		pte = swp_entry_to_pte(entry);
+		if (pte_swp_soft_dirty(*ptep))
+			pte = pte_mksoft_dirty(pte);
+	} else
+		flush_dcache_page(new);
+
 	set_pte_at(mm, addr, ptep, pte);
 
 	if (PageHuge(new)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 1ef3640..719c334 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -61,6 +61,7 @@
 #include <linux/hugetlb.h>
 #include <linux/backing-dev.h>
 #include <linux/page_idle.h>
+#include <linux/memremap.h>
 
 #include <asm/tlbflush.h>
 
@@ -1455,6 +1456,52 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			goto out;
 	}
 
+	if ((flags & TTU_MIGRATION) && is_zone_device_page(page)) {
+		swp_entry_t entry;
+		pte_t swp_pte;
+		pmd_t *pmdp;
+
+		if (!dev_page_allow_migrate(page))
+			goto out;
+
+		pmdp = mm_find_pmd(mm, address);
+		if (!pmdp)
+			goto out;
+
+		pte = pte_offset_map_lock(mm, pmdp, address, &ptl);
+		if (!pte)
+			goto out;
+
+		pteval = ptep_get_and_clear(mm, address, pte);
+		if (pte_present(pteval) || pte_none(pteval)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		entry = pte_to_swp_entry(pteval);
+		if (!is_device_entry(entry)) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		if (device_entry_to_page(entry) != page) {
+			set_pte_at(mm, address, pte, pteval);
+			goto out_unmap;
+		}
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		entry = make_migration_entry(page, 0);
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(*pte))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(mm, address, pte, swp_pte);
+		goto discard;
+	}
+
 	pte = page_check_address(page, mm, address, &ptl,
 				 PageTransCompound(page));
 	if (!pte)
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (10 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm; +Cc: John Hubbard, Jérôme Glisse

Allow migration without copy in case destination page already have
source page content. This is usefull for HMM migration to device
where we copy page before doing the final migration step.

This feature need carefull audit of filesystem code to make sure
that no one can write to the source page while it is unmapped and
locked. It should be safe for most filesystem but as precaution
return error until support for device migration is added to them.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/staging/lustre/lustre/llite/rw26.c |  8 +++--
 fs/aio.c                                   |  7 +++-
 fs/btrfs/disk-io.c                         | 11 ++++--
 fs/hugetlbfs/inode.c                       |  9 +++--
 fs/nfs/internal.h                          |  5 +--
 fs/nfs/write.c                             |  9 +++--
 fs/ubifs/file.c                            |  8 ++++-
 include/linux/balloon_compaction.h         |  3 +-
 include/linux/fs.h                         | 13 ++++---
 include/linux/migrate.h                    |  7 ++--
 mm/balloon_compaction.c                    |  2 +-
 mm/migrate.c                               | 56 +++++++++++++++++++-----------
 mm/zsmalloc.c                              | 12 ++++++-
 13 files changed, 106 insertions(+), 44 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index d98c7ac..e163d43 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -43,6 +43,7 @@
 #include <linux/uaccess.h>
 
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/fs.h>
 #include <linux/buffer_head.h>
 #include <linux/mpage.h>
@@ -643,9 +644,12 @@ static int ll_write_end(struct file *file, struct address_space *mapping,
 #ifdef CONFIG_MIGRATION
 static int ll_migratepage(struct address_space *mapping,
 			  struct page *newpage, struct page *page,
-			  enum migrate_mode mode
-		)
+			  enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/* Always fail page migration until we have a proper implementation */
 	return -EIO;
 }
diff --git a/fs/aio.c b/fs/aio.c
index 4fe81d1..416c7ef 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -37,6 +37,7 @@
 #include <linux/blkdev.h>
 #include <linux/compat.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
@@ -363,13 +364,17 @@ static const struct file_operations aio_ring_fops = {
 
 #if IS_ENABLED(CONFIG_MIGRATION)
 static int aio_migratepage(struct address_space *mapping, struct page *new,
-			struct page *old, enum migrate_mode mode)
+			   struct page *old, enum migrate_mode mode, bool copy)
 {
 	struct kioctx *ctx;
 	unsigned long flags;
 	pgoff_t idx;
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(new))
+		return -EINVAL;
+
 	rc = 0;
 
 	/* mapping->private_lock here protects against the kioctx teardown.  */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54bc8c7..9a29aa5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -27,6 +27,7 @@
 #include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/ratelimit.h>
 #include <linux/uuid.h>
 #include <linux/semaphore.h>
@@ -1023,9 +1024,13 @@ out_w_error:
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page,
-			enum migrate_mode mode)
+			     struct page *newpage, struct page *page,
+			     enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * we can't safely write a btree page from here,
 	 * we haven't done the locking hook
@@ -1039,7 +1044,7 @@ static int btree_migratepage(struct address_space *mapping,
 	if (page_has_private(page) &&
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7337cac..de77e6f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -35,6 +35,7 @@
 #include <linux/security.h>
 #include <linux/magic.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/uio.h>
 
 #include <asm/uaccess.h>
@@ -842,11 +843,15 @@ static int hugetlbfs_set_page_dirty(struct page *page)
 }
 
 static int hugetlbfs_migrate_page(struct address_space *mapping,
-				struct page *newpage, struct page *page,
-				enum migrate_mode mode)
+				  struct page *newpage, struct page *page,
+				  enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index da9e558..db1c2ad 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -537,8 +537,9 @@ void nfs_clear_pnfs_ds_commit_verifiers(struct pnfs_ds_commit_info *cinfo)
 
 
 #ifdef CONFIG_MIGRATION
-extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *, enum migrate_mode);
+extern int nfs_migrate_page(struct address_space *mapping,
+			    struct page *newpage, struct page *page,
+			    enum migrate_mode, bool copy);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5321183..d7130a5 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -14,6 +14,7 @@
 #include <linux/writeback.h>
 #include <linux/swap.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 #include <linux/sunrpc/clnt.h>
 #include <linux/nfs_fs.h>
@@ -2023,8 +2024,12 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		     struct page *page, enum migrate_mode mode, bool copy)
 {
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
 	 * an in-progress read or write request. Don't try to migrate it.
@@ -2039,7 +2044,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 	if (!nfs_fscache_release_page(page, GFP_KERNEL))
 		return -EBUSY;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, copy);
 }
 #endif
 
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 7bbf420..57bff28 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -53,6 +53,7 @@
 #include <linux/mount.h>
 #include <linux/slab.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 
 static int read_block(struct inode *inode, void *addr, unsigned int block,
 		      struct ubifs_data_node *dn)
@@ -1455,10 +1456,15 @@ static int ubifs_set_page_dirty(struct page *page)
 
 #ifdef CONFIG_MIGRATION
 static int ubifs_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			      struct page *newpage, struct page *page,
+			      enum migrate_mode mode, bool copy)
 {
 	int rc;
 
+	/* Can only migrate addressable memory for now */
+	if (!is_addressable_page(newpage))
+		return -EINVAL;
+
 	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 79542b2..27cf3e3 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -85,7 +85,8 @@ extern bool balloon_page_isolate(struct page *page,
 extern void balloon_page_putback(struct page *page);
 extern int balloon_page_migrate(struct address_space *mapping,
 				struct page *newpage,
-				struct page *page, enum migrate_mode mode);
+				struct page *page, enum migrate_mode mode,
+				bool copy);
 
 /*
  * balloon_page_insert - insert a page into the balloon's page list and make
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02bc78e..a54d164 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -396,8 +396,9 @@ struct address_space_operations {
 	 * migrate the contents of a page to the specified target. If
 	 * migrate_mode is MIGRATE_ASYNC, it must not block.
 	 */
-	int (*migratepage) (struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+	int (*migratepage)(struct address_space *mapping,
+			   struct page *newpage, struct page *page,
+			   enum migrate_mode, bool copy);
 	bool (*isolate_page)(struct page *, isolate_mode_t);
 	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
@@ -2989,9 +2990,11 @@ extern int generic_file_fsync(struct file *, loff_t, loff_t, int);
 extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *,
-				enum migrate_mode);
+extern int buffer_migrate_page(struct address_space *mapping,
+			       struct page *newpage,
+			       struct page *page,
+			       enum migrate_mode,
+			       bool copy);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ae8d475..37b77ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -33,8 +33,11 @@ extern char *migrate_reason_names[MR_TYPES];
 #ifdef CONFIG_MIGRATION
 
 extern void putback_movable_pages(struct list_head *l);
-extern int migrate_page(struct address_space *,
-			struct page *, struct page *, enum migrate_mode);
+extern int migrate_page(struct address_space *mapping,
+			struct page *newpage,
+			struct page *page,
+			enum migrate_mode,
+			bool copy);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
 extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index da91df5..ed5cacb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -135,7 +135,7 @@ void balloon_page_putback(struct page *page)
 /* move_to_new_page() counterpart for a ballooned page */
 int balloon_page_migrate(struct address_space *mapping,
 		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		enum migrate_mode mode, bool copy)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 6b6b457..d9ce8db 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -622,18 +622,10 @@ static void copy_huge_page(struct page *dst, struct page *src)
 	}
 }
 
-/*
- * Copy the page to its new location
- */
-void migrate_page_copy(struct page *newpage, struct page *page)
+static void migrate_page_states(struct page *newpage, struct page *page)
 {
 	int cpupid;
 
-	if (PageHuge(page) || PageTransHuge(page))
-		copy_huge_page(newpage, page);
-	else
-		copy_highpage(newpage, page);
-
 	if (PageError(page))
 		SetPageError(newpage);
 	if (PageReferenced(page))
@@ -687,6 +679,19 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 
 	mem_cgroup_migrate(page, newpage);
 }
+
+/*
+ * Copy the page to its new location
+ */
+void migrate_page_copy(struct page *newpage, struct page *page)
+{
+	if (PageHuge(page) || PageTransHuge(page))
+		copy_huge_page(newpage, page);
+	else
+		copy_highpage(newpage, page);
+
+	migrate_page_states(newpage, page);
+}
 EXPORT_SYMBOL(migrate_page_copy);
 
 /************************************************************
@@ -700,8 +705,8 @@ EXPORT_SYMBOL(migrate_page_copy);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page,
-		enum migrate_mode mode)
+		 struct page *newpage, struct page *page,
+		 enum migrate_mode mode, bool copy)
 {
 	int rc;
 
@@ -712,7 +717,11 @@ int migrate_page(struct address_space *mapping,
 	if (rc != MIGRATEPAGE_SUCCESS)
 		return rc;
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
+
 	return MIGRATEPAGE_SUCCESS;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -724,13 +733,14 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page, enum migrate_mode mode)
+			struct page *newpage, struct page *page,
+			enum migrate_mode mode, bool copy)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page, mode);
+		return migrate_page(mapping, newpage, page, mode, copy);
 
 	head = page_buffers(page);
 
@@ -762,12 +772,15 @@ int buffer_migrate_page(struct address_space *mapping,
 
 	SetPagePrivate(newpage);
 
-	migrate_page_copy(newpage, page);
+	if (copy)
+		migrate_page_copy(newpage, page);
+	else
+		migrate_page_states(newpage, page);
 
 	bh = head;
 	do {
 		unlock_buffer(bh);
- 		put_bh(bh);
+		put_bh(bh);
 		bh = bh->b_this_page;
 
 	} while (bh != head);
@@ -822,7 +835,8 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page, enum migrate_mode mode)
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
 {
 	if (PageDirty(page)) {
 		/* Only writeback pages in full synchronous migration */
@@ -839,7 +853,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page, mode);
+	return migrate_page(mapping, newpage, page, mode, true);
 }
 
 /*
@@ -867,7 +881,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	if (likely(is_lru)) {
 		if (!mapping)
-			rc = migrate_page(mapping, newpage, page, mode);
+			rc = migrate_page(mapping, newpage, page, mode, true);
 		else if (mapping->a_ops->migratepage)
 			/*
 			 * Most pages have a mapping and most filesystems
@@ -877,7 +891,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 			 * for page migration.
 			 */
 			rc = mapping->a_ops->migratepage(mapping, newpage,
-							page, mode);
+							page, mode, true);
 		else
 			rc = fallback_migrate_page(mapping, newpage,
 							page, mode);
@@ -894,7 +908,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		}
 
 		rc = mapping->a_ops->migratepage(mapping, newpage,
-						page, mode);
+						page, mode, true);
 		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
 			!PageIsolated(page));
 	}
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 7b5fd2b..7bf9bea 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -54,6 +54,7 @@
 #include <linux/zpool.h>
 #include <linux/mount.h>
 #include <linux/migrate.h>
+#include <linux/memremap.h>
 #include <linux/pagemap.h>
 #include <trace/events/zsmalloc.h>
 
@@ -2022,7 +2023,7 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 }
 
 int zs_page_migrate(struct address_space *mapping, struct page *newpage,
-		struct page *page, enum migrate_mode mode)
+		    struct page *page, enum migrate_mode mode, bool copy)
 {
 	struct zs_pool *pool;
 	struct size_class *class;
@@ -2040,6 +2041,15 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
+	/*
+	 * Offloading copy operation for zspage require special considerations
+	 * due to locking so for now we only support regular migration. I do
+	 * not expect we will ever want to support offloading copy. See hmm.h
+	 * for more informations on hmm_vma_migrate() and offload copy.
+	 */
+	if (!copy || !is_addressable_page(newpage))
+		return -EINVAL;
+
 	zspage = get_zspage(page);
 
 	/* Concurrent compactor cannot migrate any subpage in zspage */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (11 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 14/16] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This patch add a new memory migration helpers, which migrate memory
backing a range of virtual address of a process to different memory
(which can be allocated through special allocator). It differs from
numa migration by working on a range of virtual address and thus by
doing migration in chunk that can be large enough to use DMA engine
or special copy offloading engine.

Expected users are any one with heterogeneous memory where different
memory have different characteristics (latency, bandwidth, ...). As
an example IBM platform with CAPI bus can make use of this feature
to migrate between regular memory and CAPI device memory. New CPU
architecture with a pool of high performance memory not manage as
cache but presented as regular memory (while being faster and with
lower latency than DDR) will also be prime user of this patch.

Migration to private device memory will be usefull for device that
have large pool of such like GPU, NVidia plans to use HMM for that.

Changed since v1:
  - typos fix
  - split early unmap optimization for page with single mapping

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h |  66 +++++++-
 mm/Kconfig          |  13 ++
 mm/migrate.c        | 460 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 536 insertions(+), 3 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index f19c2a0..b1de4e1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -88,10 +88,13 @@ struct hmm;
  * HMM_PFN_ERROR: corresponding CPU page table entry point to poisonous memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is none (pte_none() true)
  * HMM_PFN_DEVICE: this is device memory (ie a ZONE_DEVICE page)
+ * HMM_PFN_LOCKED: underlying struct page is lock
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special ie result of
  *      vm_insert_pfn() or vm_insert_page() and thus should not be mirror by a
  *      device (the entry will never have HMM_PFN_VALID set and the pfn value
  *      is undefine)
+ * HMM_PFN_MIGRATE: use by hmm_vma_migrate() to signify which address can be
+ *      migrated
  * HMM_PFN_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
 typedef unsigned long hmm_pfn_t;
@@ -102,9 +105,11 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_ERROR (1 << 3)
 #define HMM_PFN_EMPTY (1 << 4)
 #define HMM_PFN_DEVICE (1 << 5)
-#define HMM_PFN_SPECIAL (1 << 6)
-#define HMM_PFN_UNADDRESSABLE (1 << 7)
-#define HMM_PFN_SHIFT 8
+#define HMM_PFN_LOCKED (1 << 6)
+#define HMM_PFN_SPECIAL (1 << 7)
+#define HMM_PFN_MIGRATE (1 << 8)
+#define HMM_PFN_UNADDRESSABLE (1 << 9)
+#define HMM_PFN_SHIFT 10
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid hmm_pfn_t
@@ -317,6 +322,61 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
+#if IS_ENABLED(CONFIG_HMM_MIGRATE)
+/*
+ * struct hmm_migrate_ops - migrate operation callback
+ *
+ * @alloc_and_copy: alloc destination memoiry and copy source to it
+ * @finalize_and_map: allow caller to inspect successfull migrated page
+ *
+ * The new HMM migrate helper hmm_vma_migrate() allow memory migration to use
+ * device DMA engine to perform copy from source to destination memory it also
+ * allow caller to use its own memory allocator for destination memory.
+ *
+ * Note that in alloc_and_copy device driver can decide not to migrate some of
+ * the entry by simply setting corresponding dst_pfns to 0.
+ *
+ * Destination page must locked and HMM_PFN_LOCKED flag set in corresponding
+ * hmm_pfn_t entry of dst_pfns array. It is expected that page allocated will
+ * have an elevated refcount and that a put_page() will free the page.
+ *
+ * Device driver might want to allocate with an extra-refcount if they want to
+ * control deallocation of failed migration inside finalize_and_map() callback.
+ *
+ * Inside finalize_and_map() device driver must use the HMM_PFN_MIGRATE flag to
+ * determine which page have been successfully migrated (this is set inside the
+ * src_pfns array).
+ *
+ * For migration from device memory to system memory device driver must set any
+ * dst_pfns entry to HMM_PFN_ERROR for any entry it can not migrate back due to
+ * hardware fatal failure that can not be recovered. Such failure will trigger
+ * a SIGBUS for the process trying to access such memory.
+ */
+struct hmm_migrate_ops {
+	void (*alloc_and_copy)(struct vm_area_struct *vma,
+			       const hmm_pfn_t *src_pfns,
+			       hmm_pfn_t *dst_pfns,
+			       unsigned long start,
+			       unsigned long end,
+			       void *private);
+	void (*finalize_and_map)(struct vm_area_struct *vma,
+				 const hmm_pfn_t *src_pfns,
+				 hmm_pfn_t *dst_pfns,
+				 unsigned long start,
+				 unsigned long end,
+				 void *private);
+};
+
+int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
+		    struct vm_area_struct *vma,
+		    hmm_pfn_t *src_pfns,
+		    hmm_pfn_t *dst_pfns,
+		    unsigned long start,
+		    unsigned long end,
+		    void *private);
+#endif /* IS_ENABLED(CONFIG_HMM_MIGRATE) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 7dd4ca3..dd091da 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -308,6 +308,19 @@ config HMM_MIRROR
 	  range of virtual address. This require careful synchronization with
 	  CPU page table update.
 
+config HMM_MIGRATE
+	bool "HMM migrate virtual range of process using device driver DMA"
+	select HMM
+	select MIGRATION
+	help
+	  HMM migrate is a new helper to migrate range of virtual address using
+	  special page allocator and copy callback. This allow device driver to
+	  migrate range of a process memory to its memory using its DMA engine.
+
+	  It obyes all rules of memory migration, except that it supports the
+	  migration of ZONE_DEVICE page that have MEMOY_DEVICE_ALLOW_MIGRATE
+	  flag set.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/migrate.c b/mm/migrate.c
index d9ce8db..5ebd3c5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -41,6 +41,7 @@
 #include <linux/page_idle.h>
 #include <linux/page_owner.h>
 #include <linux/memremap.h>
+#include <linux/hmm.h>
 
 #include <asm/tlbflush.h>
 
@@ -421,6 +422,14 @@ int migrate_page_move_mapping(struct address_space *mapping,
 	int expected_count = 1 + extra_count;
 	void **pslot;
 
+	/*
+	 * ZONE_DEVICE pages have 1 refcount always held by their device
+	 *
+	 * Note that DAX memory will never reach that point as it does not have
+	 * the MEMORY_DEVICE_ALLOW_MIGRATE flag set (see memory_hotplug.h).
+	 */
+	expected_count += is_zone_device_page(page);
+
 	if (!mapping) {
 		/* Anonymous page without mapping */
 		if (page_count(page) != expected_count)
@@ -2087,3 +2096,454 @@ out_unlock:
 #endif /* CONFIG_NUMA_BALANCING */
 
 #endif /* CONFIG_NUMA */
+
+
+#if IS_ENABLED(CONFIG_HMM_MIGRATE)
+struct hmm_migrate {
+	struct vm_area_struct	*vma;
+	hmm_pfn_t		*dst_pfns;
+	hmm_pfn_t		*src_pfns;
+	unsigned long		npages;
+	unsigned long		start;
+	unsigned long		end;
+};
+
+static int hmm_collect_walk_pmd(pmd_t *pmdp,
+				unsigned long start,
+				unsigned long end,
+				struct mm_walk *walk)
+{
+	struct hmm_migrate *migrate = walk->private;
+	struct mm_struct *mm = walk->vma->vm_mm;
+	unsigned long addr = start;
+	hmm_pfn_t *src_pfns;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+again:
+	if (pmd_none(*pmdp))
+		return 0;
+
+	split_huge_pmd(walk->vma, pmdp, addr);
+	if (pmd_trans_unstable(pmdp))
+		goto again;
+
+	src_pfns = &migrate->src_pfns[(addr - migrate->start) >> PAGE_SHIFT];
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+
+	for (; addr < end; addr += PAGE_SIZE, src_pfns++, ptep++) {
+		unsigned long pfn;
+		swp_entry_t entry;
+		struct page *page;
+		hmm_pfn_t flags;
+		bool write;
+		pte_t pte;
+
+		pte = *ptep;
+
+		if (!pte_present(pte)) {
+			if (pte_none(pte))
+				continue;
+
+			/*
+			 * Only care about un-addressable device page special
+			 * page table entry. Other special swap entry are not
+			 * migratable and we ignore regular swaped page.
+			 */
+			entry = pte_to_swp_entry(pte);
+			if (!is_device_entry(entry))
+				continue;
+
+			flags = HMM_PFN_DEVICE | HMM_PFN_UNADDRESSABLE;
+			write = is_write_device_entry(entry);
+			page = device_entry_to_page(entry);
+			pfn = page_to_pfn(page);
+
+			if (!dev_page_allow_migrate(page))
+				continue;
+		} else {
+			pfn = pte_pfn(pte);
+			write = pte_write(pte);
+			page = pfn_to_page(pfn);
+			flags = is_zone_device_page(page) ? HMM_PFN_DEVICE : 0;
+		}
+
+		/* FIXME support THP see hmm_migrate_page_check() */
+		if (PageTransCompound(page))
+			continue;
+
+		/*
+		 * Corner case handling:
+		 * 1. When a new swap-cache page is read into, it is added to
+		 * the LRU and treated as swapcache but it has no rmap yet. Skip
+		 * those.
+		 */
+		if (!page->mapping)
+			continue;
+
+		*src_pfns = hmm_pfn_from_pfn(pfn) | HMM_PFN_MIGRATE | flags;
+		*src_pfns |= write ? HMM_PFN_WRITE : 0;
+		migrate->npages++;
+
+		/*
+		 * By getting a reference on the page we pin it and blocks any
+		 * kind of migration. Side effect is that it "freeze" the pte.
+		 *
+		 * We drop this reference after isolating the page from the lru
+		 * for non device page (device page are not on the lru and thus
+		 * can't be drop from it).
+		 */
+		get_page(page);
+	}
+	pte_unmap_unlock(ptep - 1, ptl);
+
+	return 0;
+}
+
+/*
+ * hmm_migrate_collect() - collect page over range of virtual address
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This will go over the CPU page table and for each virtual address back by a
+ * valid page it update the src_pfns array and take a reference on the page in
+ * order to pin the page until we lock it and unmap it.
+ */
+static void hmm_migrate_collect(struct hmm_migrate *migrate)
+{
+	struct mm_walk mm_walk;
+
+	mm_walk.pmd_entry = hmm_collect_walk_pmd;
+	mm_walk.pte_entry = NULL;
+	mm_walk.pte_hole = NULL;
+	mm_walk.hugetlb_entry = NULL;
+	mm_walk.test_walk = NULL;
+	mm_walk.vma = migrate->vma;
+	mm_walk.mm = migrate->vma->vm_mm;
+	mm_walk.private = migrate;
+
+	mmu_notifier_invalidate_range_start(mm_walk.mm,
+					    migrate->start,
+					    migrate->end);
+	walk_page_range(migrate->start, migrate->end, &mm_walk);
+	mmu_notifier_invalidate_range_end(mm_walk.mm,
+					  migrate->start,
+					  migrate->end);
+}
+
+/*
+ * hmm_migrate_page_check() - check if page is pin or not
+ * @page: struct page to check
+ *
+ * Pinned page can not be migrated. Same test in migrate_page_move_mapping()
+ * except that here we allow migration of ZONE_DEVICE page.
+ */
+static inline bool hmm_migrate_page_check(struct page *page)
+{
+	/*
+	 * One extra ref because caller hold an extra reference either from
+	 * either isolate_lru_page() for regular page or hmm_migrate_collect()
+	 * for device page.
+	 */
+	int extra = 1;
+
+	/*
+	 * FIXME support THP (transparent huge page), it is bit more complex to
+	 * check them then regular page because they can be map with a pmd or
+	 * with a pte (split pte mapping).
+	 */
+	if (PageCompound(page))
+		return false;
+
+	/* Page from ZONE_DEVICE have one extra reference */
+	if (is_zone_device_page(page)) {
+		if (!dev_page_allow_migrate(page))
+			return false;
+		extra++;
+	}
+
+	if ((page_count(page) - extra) > page_mapcount(page))
+		return false;
+
+	return true;
+}
+
+/*
+ * hmm_migrate_lock_and_isolate() - lock pages and isolate them from the lru
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This lock pages that have been collected by hmm_migrate_collect(). Once page
+ * is locked it is isolated from the lru (for non device page). Finaly the ref
+ * taken by hmm_migrate_collect() is drop as locked page can not be migrated by
+ * concurrent kernel thread.
+ */
+static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+	bool allow_drain = true;
+
+	lru_add_drain();
+
+	for (; (addr<migrate->end) && migrate->npages; addr+=PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page)
+			continue;
+
+		lock_page(page);
+		migrate->src_pfns[i] |= HMM_PFN_LOCKED;
+
+		/* ZONE_DEVICE page are not on LRU */
+		if (!is_zone_device_page(page)) {
+			if (!PageLRU(page) && allow_drain) {
+				/* Drain CPU's pagevec */
+				lru_add_drain_all();
+				allow_drain = false;
+			}
+
+			if (isolate_lru_page(page)) {
+				migrate->src_pfns[i] = 0;
+				migrate->npages--;
+				unlock_page(page);
+				put_page(page);
+			} else
+				/* Drop the reference we took in collect */
+				put_page(page);
+		}
+
+		if (!hmm_migrate_page_check(page)) {
+			migrate->src_pfns[i] = 0;
+			migrate->npages--;
+			unlock_page(page);
+			put_page(page);
+		}
+	}
+}
+
+/*
+ * hmm_migrate_unmap() - replace page mapping with special migration pte entry
+ * @migrate: migrate struct containing all migration informations
+ *
+ * Replace page mapping (CPU page table pte) with special migration pte entry
+ * and check again if it has be pin. Pin page are restore because we can not
+ * migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy content of original page over to new page.
+ */
+static void hmm_migrate_unmap(struct hmm_migrate *migrate)
+{
+	int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	unsigned long addr = migrate->start, i = 0, restore = 0;
+
+	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page || !(migrate->src_pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		try_to_unmap(page, flags);
+		if (page_mapped(page) || !hmm_migrate_page_check(page)) {
+			migrate->src_pfns[i] &= ~HMM_PFN_MIGRATE;
+			migrate->npages--;
+			restore++;
+		}
+	}
+
+	for (; (addr < migrate->end) && restore; addr += PAGE_SIZE, i++) {
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page || (migrate->src_pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		remove_migration_ptes(page, page, false);
+
+		migrate->src_pfns[i] = 0;
+		unlock_page(page);
+		restore--;
+
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
+	}
+}
+
+/*
+ * hmm_migrate_struct_page() - migrate meta-data from src page to dst page
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This migrate struct page meta-data from source struct page to destination
+ * struct page. This effectively finish the migration from source page to the
+ * destination page.
+ */
+static void hmm_migrate_struct_page(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+
+	for (; addr < migrate->end; addr += PAGE_SIZE, i++) {
+		struct page *newpage = hmm_pfn_to_page(migrate->dst_pfns[i]);
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+		struct address_space *mapping;
+		int r;
+
+		if (!page || !newpage)
+			continue;
+		if (!(migrate->src_pfns[i] & HMM_PFN_MIGRATE))
+			continue;
+
+		mapping = page_mapping(page);
+
+		/*
+		 * For now only support private anonymous when migrating
+		 * to un-addressable device memory.
+		 */
+		if (mapping && is_zone_device_page(newpage) &&
+		    !is_addressable_page(newpage)) {
+			migrate->src_pfns[i] &= ~HMM_PFN_MIGRATE;
+			continue;
+		}
+
+		r = migrate_page(mapping, newpage, page, MIGRATE_SYNC, false);
+		if (r != MIGRATEPAGE_SUCCESS)
+			migrate->src_pfns[i] &= ~HMM_PFN_MIGRATE;
+	}
+}
+
+/*
+ * hmm_migrate_remove_migration_pte() - restore CPU page table entry
+ * @migrate: migrate struct containing all migration informations
+ *
+ * This replace the special migration pte entry with either a mapping to the
+ * new page if migration was successful for that page or to the original page
+ * otherwise.
+ *
+ * This also unlock the page and put them back on the lru or drop the extra
+ * ref for device page.
+ */
+static void hmm_migrate_remove_migration_pte(struct hmm_migrate *migrate)
+{
+	unsigned long addr = migrate->start, i = 0;
+
+	for (; (addr<migrate->end) && migrate->npages; addr+=PAGE_SIZE, i++) {
+		struct page *newpage = hmm_pfn_to_page(migrate->dst_pfns[i]);
+		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+
+		if (!page)
+			continue;
+		newpage = newpage ? newpage : page;
+
+		remove_migration_ptes(page, newpage, false);
+		unlock_page(page);
+		migrate->npages--;
+
+		if (is_zone_device_page(page))
+			put_page(page);
+		else
+			putback_lru_page(page);
+
+		if (newpage != page) {
+			unlock_page(newpage);
+			if (is_zone_device_page(newpage))
+				put_page(newpage);
+			else
+				putback_lru_page(newpage);
+		}
+	}
+}
+
+/*
+ * hmm_vma_migrate() - migrate a range of memory inside vma using accel copy
+ *
+ * @ops: migration callback for allocating destination memory and copying
+ * @vma: virtual memory area containing the range to be migrated
+ * @src_pfns: array of hmm_pfn_t containing source pfns
+ * @dst_pfns: array of hmm_pfn_t containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, error code otherwise
+ *
+ * This will try to migrate a range of memory using callback to allocate and
+ * copy memory from source to destination. This function will first collect,
+ * lock and unmap pages in the range and then call alloc_and_copy() callback
+ * for device driver to allocate destination memory and copy from source.
+ *
+ * Then it will proceed and try to effectively migrate the page (struct page
+ * metadata) a step that can fail for various reasons. Before updating CPU page
+ * table it will call finalize_and_map() callback so that device driver can
+ * inspect what have been successfully migrated and update its own page table
+ * (this latter aspect is not mandatory and only make sense for some user of
+ * this API).
+ *
+ * Finaly the function update CPU page table and unlock the pages before
+ * returning 0.
+ *
+ * It will return an error code only if one of the argument is invalid.
+ */
+int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
+		    struct vm_area_struct *vma,
+		    hmm_pfn_t *src_pfns,
+		    hmm_pfn_t *dst_pfns,
+		    unsigned long start,
+		    unsigned long end,
+		    void *private)
+{
+	struct hmm_migrate migrate;
+
+	/* Sanity check the arguments */
+	start &= PAGE_MASK;
+	end &= PAGE_MASK;
+	if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL))
+		return -EINVAL;
+	if (!vma || !ops || !src_pfns || !dst_pfns || start >= end)
+		return -EINVAL;
+	if (start < vma->vm_start || start >= vma->vm_end)
+		return -EINVAL;
+	if (end <= vma->vm_start || end > vma->vm_end)
+		return -EINVAL;
+
+	memset(src_pfns, 0, sizeof(*src_pfns) * ((end - start) >> PAGE_SHIFT));
+	migrate.src_pfns = src_pfns;
+	migrate.dst_pfns = dst_pfns;
+	migrate.start = start;
+	migrate.npages = 0;
+	migrate.end = end;
+	migrate.vma = vma;
+
+	/* Collect, and try to unmap source pages */
+	hmm_migrate_collect(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/* Lock and isolate page */
+	hmm_migrate_lock_and_isolate(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/* Unmap pages */
+	hmm_migrate_unmap(&migrate);
+	if (!migrate.npages)
+		return 0;
+
+	/*
+	 * At this point pages are lock and unmap and thus they have stable
+	 * content and can safely be copied to destination memory that is
+	 * allocated by the callback.
+	 *
+	 * Note that migration can fail in hmm_migrate_struct_page() for each
+	 * individual page.
+	 */
+	ops->alloc_and_copy(vma, src_pfns, dst_pfns, start, end, private);
+
+	/* This does the real migration of struct page */
+	hmm_migrate_struct_page(&migrate);
+
+	ops->finalize_and_map(vma, src_pfns, dst_pfns, start, end, private);
+
+	/* Unlock and remap pages */
+	hmm_migrate_remove_migration_pte(&migrate);
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_vma_migrate);
+#endif /* IS_ENABLED(CONFIG_HMM_MIGRATE) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 14/16] mm/hmm/migrate: optimize page map once in vma being migrated
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (12 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 16/16] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

Common case for migration of virtual address range is page are map
only once inside the vma in which migration is taking place. Because
we already walk the CPU page table for that range we can directly do
the unmap there and setup special migration swap entry.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 mm/migrate.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 170 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 5ebd3c5..39dad11 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2116,6 +2116,7 @@ static int hmm_collect_walk_pmd(pmd_t *pmdp,
 	struct hmm_migrate *migrate = walk->private;
 	struct mm_struct *mm = walk->vma->vm_mm;
 	unsigned long addr = start;
+	unsigned long unmaped = 0;
 	hmm_pfn_t *src_pfns;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -2130,6 +2131,7 @@ again:
 
 	src_pfns = &migrate->src_pfns[(addr - migrate->start) >> PAGE_SHIFT];
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
 
 	for (; addr < end; addr += PAGE_SIZE, src_pfns++, ptep++) {
 		unsigned long pfn;
@@ -2194,9 +2196,44 @@ again:
 		 * can't be drop from it).
 		 */
 		get_page(page);
+
+		/*
+		 * Optimize for common case where page is only map once in one
+		 * process. If we can lock the page then we can safely setup
+		 * special migration page table entry now.
+		 */
+		if (!trylock_page(page)) {
+			set_pte_at(mm, addr, ptep, pte);
+		} else {
+			pte_t swp_pte;
+
+			*src_pfns |= HMM_PFN_LOCKED;
+			ptep_get_and_clear(mm, addr, ptep);
+
+			/* Setup special migration page table entry */
+			entry = make_migration_entry(page, write);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pte))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, addr, ptep, swp_pte);
+
+			/*
+			 * This is like regulat unmap we remove the rmap and
+			 * drop page refcount. Page won't be free as we took
+			 * a reference just above.
+			 */
+			page_remove_rmap(page, false);
+			put_page(page);
+			unmaped++;
+		}
 	}
+	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
 
+	/* Only flush the TLB if we actually modified any entries */
+	if (unmaped)
+		flush_tlb_range(walk->vma, start, end);
+
 	return 0;
 }
 
@@ -2279,18 +2316,26 @@ static inline bool hmm_migrate_page_check(struct page *page)
 static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
 {
 	unsigned long addr = migrate->start, i = 0;
+ 	struct mm_struct *mm = migrate->vma->vm_mm;
+ 	struct vm_area_struct *vma = migrate->vma;
+	hmm_pfn_t *src_pfns = migrate->src_pfns;
+	unsigned long restore = 0;
 	bool allow_drain = true;
 
 	lru_add_drain();
 
 	for (; (addr<migrate->end) && migrate->npages; addr+=PAGE_SIZE, i++) {
-		struct page *page = hmm_pfn_to_page(migrate->src_pfns[i]);
+		struct page *page = hmm_pfn_to_page(src_pfns[i]);
+		bool need_restore = true;
 
 		if (!page)
 			continue;
 
-		lock_page(page);
-		migrate->src_pfns[i] |= HMM_PFN_LOCKED;
+		if (!(src_pfns[i] & HMM_PFN_LOCKED)) {
+			lock_page(page);
+			need_restore = false;
+			src_pfns[i] |= HMM_PFN_LOCKED;
+		}
 
 		/* ZONE_DEVICE page are not on LRU */
 		if (!is_zone_device_page(page)) {
@@ -2301,20 +2346,135 @@ static void hmm_migrate_lock_and_isolate(struct hmm_migrate *migrate)
 			}
 
 			if (isolate_lru_page(page)) {
-				migrate->src_pfns[i] = 0;
-				migrate->npages--;
-				unlock_page(page);
-				put_page(page);
+				if (need_restore) {
+					src_pfns[i] &= ~HMM_PFN_MIGRATE;
+					restore++;
+				} else {
+					migrate->npages--;
+					unlock_page(page);
+					src_pfns[i] = 0;
+					put_page(page);
+				}
 			} else
 				/* Drop the reference we took in collect */
 				put_page(page);
 		}
 
 		if (!hmm_migrate_page_check(page)) {
-			migrate->src_pfns[i] = 0;
-			migrate->npages--;
+			if (need_restore) {
+				src_pfns[i] &= ~HMM_PFN_MIGRATE;
+				restore++;
+			} else {
+				migrate->npages--;
+				unlock_page(page);
+				src_pfns[i] = 0;
+				put_page(page);
+			}
+		}
+	}
+
+	if (!restore)
+		return;
+
+	for (addr = migrate->start, i = 0; (addr < migrate->end) && restore;) {
+		struct page *page = hmm_pfn_to_page(src_pfns[i]);
+		unsigned long next, restart;
+		spinlock_t *ptl;
+		pgd_t *pgdp;
+		pud_t *pudp;
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		if (!page || !(src_pfns[i] & HMM_PFN_MIGRATE)) {
+			addr += PAGE_SIZE;
+			i++;
+			continue;
+		}
+
+		restart = addr;
+
+		/*
+		 * Some one might have zap the mapping. Truncate should be only
+		 * case for which this might happen while holding mmap_sem.
+		 */
+		pgdp = pgd_offset(mm, addr);
+		next = pgd_addr_end(addr, migrate->end);
+		if (!pgdp || pgd_none_or_clear_bad(pgdp))
+			goto unlock_release;
+		pudp = pud_offset(pgdp, addr);
+		next = pud_addr_end(addr, migrate->end);
+		if (!pudp || pud_none(*pudp))
+			goto unlock_release;
+		pmdp = pmd_offset(pudp, addr);
+		next = pmd_addr_end(addr, migrate->end);
+		if (!pmdp || pmd_none(*pmdp) || pmd_trans_huge(*pmdp))
+			goto unlock_release;
+
+		ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+		for (; addr < next; addr += PAGE_SIZE, i++, ptep++) {
+			swp_entry_t entry;
+			bool write;
+			pte_t pte;
+
+			page = hmm_pfn_to_page(src_pfns[i]);
+			if (!page || (src_pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+
+			write = src_pfns[i] & HMM_PFN_WRITE;
+			write &= (vma->vm_flags & VM_WRITE);
+
+			/* Here it means pte must be a valid migration entry */
+			pte = ptep_get_and_clear(mm, addr, ptep);
+			if (pte_none(pte) || pte_present(pte)) {
+				/* SOMETHING BAD IS GOING ON ! */
+				set_pte_at(mm, addr, ptep, pte);
+				continue;
+			}
+			entry = pte_to_swp_entry(pte);
+			if (!is_migration_entry(entry)) {
+				/* SOMETHING BAD IS GOING ON ! */
+				set_pte_at(mm, addr, ptep, pte);
+				continue;
+			}
+
+			if (is_zone_device_page(page) &&
+			    !is_addressable_page(page)) {
+				entry = make_device_entry(page, write);
+				pte = swp_entry_to_pte(entry);
+			} else {
+				pte = mk_pte(page, vma->vm_page_prot);
+				pte = pte_mkold(pte);
+				if (write)
+					pte = pte_mkwrite(pte);
+			}
+			if (pte_swp_soft_dirty(*ptep))
+				pte = pte_mksoft_dirty(pte);
+
+			get_page(page);
+			set_pte_at(mm, addr, ptep, pte);
+			if (PageAnon(page))
+				page_add_anon_rmap(page, vma, addr, false);
+			else
+				page_add_file_rmap(page, false);
+		}
+		pte_unmap_unlock(ptep - 1, ptl);
+
+unlock_release:
+		addr = restart;
+		i = (addr - migrate->start) >> PAGE_SHIFT;
+		for (; addr < next && restore; addr += PAGE_SHIFT, i++) {
+			page = hmm_pfn_to_page(src_pfns[i]);
+			if (!page || (src_pfns[i] & HMM_PFN_MIGRATE))
+				continue;
+
+			src_pfns[i] = 0;
 			unlock_page(page);
-			put_page(page);
+			restore--;
+
+			if (is_zone_device_page(page))
+				put_page(page);
+			else
+				putback_lru_page(page);
 		}
 	}
 }
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (13 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 14/16] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  2016-12-08 16:39 ` [HMM v14 16/16] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This introduce a simple struct and associated helpers for device driver
to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
will find a unuse physical address range and trigger memory hotplug for
it which allocates and initialize struct page for the device memory.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 116 ++++++++++++++++++++++++
 mm/Kconfig          |   7 ++
 mm/hmm.c            | 250 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 373 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b1de4e1..674aa79 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,6 +76,10 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/memremap.h>
+#include <linux/completion.h>
+
+
 struct hmm;
 
 /*
@@ -377,6 +381,118 @@ int hmm_vma_migrate(const struct hmm_migrate_ops *ops,
 #endif /* IS_ENABLED(CONFIG_HMM_MIGRATE) */
 
 
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct hmm_devmem;
+
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr);
+
+/*
+ * struct hmm_devmem_ops - callback for ZONE_DEVICE memory events
+ *
+ * @free: call when refcount on page reach 1 and thus is no longer use
+ * @fault: call when there is a page fault to unaddressable memory
+ */
+struct hmm_devmem_ops {
+	void (*free)(struct hmm_devmem *devmem, struct page *page);
+	int (*fault)(struct hmm_devmem *devmem,
+		     struct vm_area_struct *vma,
+		     unsigned long addr,
+		     struct page *page,
+		     unsigned flags,
+		     pmd_t *pmdp);
+};
+
+/*
+ * struct hmm_devmem - track device memory
+ *
+ * @completion: completion object for device memory
+ * @pfn_first: first pfn for this resource (set by hmm_devmem_add())
+ * @pfn_last: last pfn for this resource (set by hmm_devmem_add())
+ * @resource: IO resource reserved for this chunk of memory
+ * @pagemap: device page map for that chunk
+ * @device: device to bind resource to
+ * @ops: memory operations callback
+ * @ref: per CPU refcount
+ * @inuse: is struct in use
+ *
+ * This an helper structure for device driver that do not wish to implement
+ * to gory details related to hotpluging new memoy and in allocating struct
+ * pages.
+ *
+ * Device driver can directly use ZONE_DEVICE memory on their own if they
+ * wish to do so.
+ */
+struct hmm_devmem {
+	struct completion		completion;
+	unsigned long			pfn_first;
+	unsigned long			pfn_last;
+	struct resource			*resource;
+	struct dev_pagemap		*pagemap;
+	struct device			*device;
+	const struct hmm_devmem_ops	*ops;
+	struct percpu_ref		ref;
+	bool				inuse;
+};
+
+/*
+ * To add (hotplug) device memory, it assumes that there is no real resource
+ * that reserve a range in the physical address space (this is intended to be
+ * use by un-addressable device memory). It will reserve a physical range big
+ * enough and allocate struct page for it.
+ *
+ * Device driver can wrap the hmm_devmem struct inside a private device driver
+ * struct. Device driver must call hmm_devmem_remove() before device goes away
+ * and before freeing the hmm_devmem struct memory.
+ */
+int hmm_devmem_add(struct hmm_devmem *devmem,
+		   const struct hmm_devmem_ops *ops,
+		   struct device *device,
+		   unsigned long size);
+bool hmm_devmem_remove(struct hmm_devmem *devmem);
+
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct hmm_migrate_ops *ops,
+			   hmm_pfn_t *src_pfns,
+			   hmm_pfn_t *dst_pfns,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private);
+
+/*
+ * hmm_devmem_page_set_drvdata - set per page driver data field
+ *
+ * @page: pointer to struct page
+ * @data: driver data value to set
+ *
+ * Because page can not be on lru we have an unsigned long that driver can use
+ * to store a per page field. This just a simple helper to do that.
+ */
+static inline void hmm_devmem_page_set_drvdata(struct page *page,
+					       unsigned long data)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	drvdata[1] = data;
+}
+
+/*
+ * hmm_devmem_page_get_drvdata - get per page driver data field
+ *
+ * @page: pointer to struct page
+ * Return: driver data value
+ */
+static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
+{
+	unsigned long *drvdata = (unsigned long *)&page->pgmap;
+
+	return drvdata[1];
+}
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
+
+
 /* Below are for HMM internal use only ! Not to be used by device driver ! */
 void hmm_mm_destroy(struct mm_struct *mm);
 
diff --git a/mm/Kconfig b/mm/Kconfig
index dd091da..e1bb33d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -321,6 +321,13 @@ config HMM_MIGRATE
 	  migration of ZONE_DEVICE page that have MEMOY_DEVICE_ALLOW_MIGRATE
 	  flag set.
 
+config HMM_DEVMEM
+	bool "HMM device memory helpers (to leverage ZONE_DEVICE)"
+	select HMM
+	help
+	  HMM devmem are helpers to leverage new ZONE_DEVICE feature. This is
+	  just to avoid device driver to replicate boiler plate code.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/hmm.c b/mm/hmm.c
index a397d45..4d3b399 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -23,10 +23,15 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/mmzone.h>
+#include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
+#include <linux/memremap.h>
 #include <linux/mmu_notifier.h>
 
+#define SECTION_SIZE (1UL << PA_SECTION_SHIFT)
+
 
 /*
  * struct hmm - HMM per mm struct
@@ -735,3 +740,248 @@ int hmm_vma_fault(struct vm_area_struct *vma,
 }
 EXPORT_SYMBOL(hmm_vma_fault);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
+
+
+#if IS_ENABLED(CONFIG_HMM_DEVMEM)
+struct page *hmm_vma_alloc_locked_page(struct vm_area_struct *vma,
+				       unsigned long addr)
+{
+	struct page *page;
+
+	page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+	if (!page)
+		return NULL;
+	lock_page(page);
+	return page;
+}
+EXPORT_SYMBOL(hmm_vma_alloc_locked_page);
+
+
+static void hmm_devmem_release(struct percpu_ref *ref)
+{
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	complete(&devmem->completion);
+	devmem->inuse = false;
+}
+
+static void hmm_devmem_exit(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	percpu_ref_exit(ref);
+	wait_for_completion(&devmem->completion);
+	devm_remove_action(devmem->device, hmm_devmem_exit, data);
+}
+
+static void hmm_devmem_kill(void *data)
+{
+	struct percpu_ref *ref = data;
+	struct hmm_devmem *devmem;
+
+	devmem = container_of(ref, struct hmm_devmem, ref);
+	devmem->inuse = false;
+	percpu_ref_kill(ref);
+	devm_remove_action(devmem->device, hmm_devmem_kill, data);
+}
+
+static int hmm_devmem_fault(struct vm_area_struct *vma,
+			    unsigned long addr,
+			    struct page *page,
+			    unsigned flags,
+			    pmd_t *pmdp)
+{
+	struct hmm_devmem *devmem = page->pgmap->data;
+
+	return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
+}
+
+static void hmm_devmem_free(struct page *page, void *data)
+{
+	struct hmm_devmem *devmem = data;
+
+	devmem->ops->free(devmem, page);
+}
+
+/*
+ * hmm_devmem_add() - hotplug fake ZONE_DEVICE memory for device memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @ops: memory event device driver callback (see struct hmm_devmem_ops)
+ * @device: device struct to bind the resource too
+ * @size: size in bytes of the device memory to add
+ * Returns: 0 on success, error code otherwise
+ *
+ * This first find an empty range of physical address big enough to for the new
+ * resource and then hotplug it as ZONE_DEVICE memory allocating struct page.
+ * It does not do anything beside that, all events affecting the memory will go
+ * through the various callback provided by hmm_devmem_ops struct.
+ */
+int hmm_devmem_add(struct hmm_devmem *devmem,
+		   const struct hmm_devmem_ops *ops,
+		   struct device *device,
+		   unsigned long size)
+{
+	const struct resource *res;
+	resource_size_t addr;
+	void *ptr;
+	int ret;
+
+	init_completion(&devmem->completion);
+	devmem->pfn_first = -1UL;
+	devmem->pfn_last = -1UL;
+	devmem->resource = NULL;
+	devmem->device = device;
+	devmem->pagemap = NULL;
+	devmem->inuse = false;
+	devmem->ops = ops;
+
+	ret = percpu_ref_init(&devmem->ref,&hmm_devmem_release,0,GFP_KERNEL);
+	if (ret)
+		return ret;
+
+	ret = devm_add_action(device, hmm_devmem_exit, &devmem->ref);
+	if (ret)
+		goto error;
+
+	size = ALIGN(size, SECTION_SIZE);
+	addr = (1UL << MAX_PHYSMEM_BITS) - size;
+
+	/*
+	 * FIXME add a new helper to quickly walk resource tree and find free
+	 * range
+	 *
+	 * FIXME what about ioport_resource resource ?
+	 */
+	for (; addr > size; addr -= size) {
+		ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
+		if (ret != REGION_DISJOINT)
+			continue;
+
+		devmem->resource = devm_request_mem_region(device, addr, size,
+							   dev_name(device));
+		if (!devmem->resource) {
+			ret = -ENOMEM;
+			goto error;
+		}
+		break;
+	}
+	if (!devmem->resource) {
+		ret = -ERANGE;
+		goto error;
+	}
+
+	ptr = devm_memremap_pages(device, devmem->resource, &devmem->ref,
+				  NULL, &devmem->pagemap,
+				  hmm_devmem_fault, hmm_devmem_free, devmem,
+				  MEMORY_DEVICE | MEMORY_DEVICE_ALLOW_MIGRATE |
+				  MEMORY_DEVICE_UNADDRESSABLE);
+	if (IS_ERR(ptr)) {
+		ret = PTR_ERR(ptr);
+		goto error;
+	}
+
+	ret = devm_add_action(device, hmm_devmem_kill, &devmem->ref);
+	if (ret) {
+		hmm_devmem_kill(&devmem->ref);
+		goto error;
+	}
+
+	res = devmem->pagemap->res;
+	devmem->pfn_first = res->start >> PAGE_SHIFT;
+	devmem->pfn_last = (resource_size(res)>>PAGE_SHIFT)+devmem->pfn_first;
+	devmem->inuse = true;
+
+	return 0;
+
+error:
+	hmm_devmem_exit(&devmem->ref);
+	return ret;
+}
+EXPORT_SYMBOL(hmm_devmem_add);
+
+/*
+ * hmm_devmem_remove() - remove device memory (kill and free ZONE_DEVICE)
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * Returns: true if device memory is no longer in use, false if still in use
+ *
+ * This will hot remove memory that was hotplug by hmm_devmem_add on behalf of
+ * device driver. It will free struct page and remove the resource that reserve
+ * the physical address range for this device memory.
+ *
+ * Device driver can not free the struct while this function return false, it
+ * must call over and over this function until it returns true. Note that if
+ * there is a refcount bug this might never happen !
+ */
+bool hmm_devmem_remove(struct hmm_devmem *devmem)
+{
+	struct device *device = devmem->device;
+
+	hmm_devmem_kill(&devmem->ref);
+
+	if (devmem->pagemap) {
+		devm_memremap_pages_remove(device, devmem->pagemap);
+		devmem->pagemap = NULL;
+	}
+
+	hmm_devmem_exit(&devmem->ref);
+
+	/* FIXME maybe wait a bit ? */
+	if (devmem->inuse)
+		return false;
+
+	if (devmem->resource) {
+		resource_size_t size = resource_size(devmem->resource);
+
+		devm_release_mem_region(device, devmem->resource->start, size);
+		devmem->resource = NULL;
+	}
+
+	return true;
+}
+EXPORT_SYMBOL(hmm_devmem_remove);
+
+/*
+ * hmm_devmem_fault_range() - migrate back a virtual range of memory
+ *
+ * @devmem: hmm_devmem struct use to track and manage the ZONE_DEVICE memory
+ * @vma: virtual memory area containing the range to be migrated
+ * @ops: migration callback for allocating destination memory and copying
+ * @src_pfns: array of hmm_pfn_t containing source pfns
+ * @dst_pfns: array of hmm_pfn_t containing destination pfns
+ * @start: start address of the range to migrate (inclusive)
+ * @addr: fault address (must be inside the range)
+ * @end: end address of the range to migrate (exclusive)
+ * @private: pointer passed back to each of the callback
+ * Returns: 0 on success, VM_FAULT_SIGBUS on error
+ *
+ * This is a wrapper around hmm_vma_migrate() which check the migration status
+ * for a given fault address and return corresponding page fault handler status
+ * ie 0 on success or VM_FAULT_SIGBUS if migration failed for fault address.
+ *
+ * This is an helper intendend to be use by ZONE_DEVICE fault handler.
+ */
+int hmm_devmem_fault_range(struct hmm_devmem *devmem,
+			   struct vm_area_struct *vma,
+			   const struct hmm_migrate_ops *ops,
+			   hmm_pfn_t *src_pfns,
+			   hmm_pfn_t *dst_pfns,
+			   unsigned long start,
+			   unsigned long addr,
+			   unsigned long end,
+			   void *private)
+{
+	if (hmm_vma_migrate(ops, vma, src_pfns, dst_pfns, start, end, private))
+		return VM_FAULT_SIGBUS;
+
+	if (dst_pfns[(addr - start) >> PAGE_SHIFT] & HMM_PFN_ERROR)
+		return VM_FAULT_SIGBUS;
+
+	return 0;
+}
+EXPORT_SYMBOL(hmm_devmem_fault_range);
+#endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [HMM v14 16/16] mm/hmm/devmem: dummy HMM device as an helper for ZONE_DEVICE memory
  2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
                   ` (14 preceding siblings ...)
  2016-12-08 16:39 ` [HMM v14 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
@ 2016-12-08 16:39 ` Jérôme Glisse
  15 siblings, 0 replies; 23+ messages in thread
From: Jérôme Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm
  Cc: John Hubbard, Jérôme Glisse, Evgeny Baskakov,
	Mark Hairgrove, Sherry Cheung, Subhash Gutti

This introduce a dummy HMM device class so device driver can use it to
create hmm_device for the sole purpose of registering device memory.
It is usefull to device driver that want to manage multiple physical
device memory under same device umbrella.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Evgeny Baskakov <ebaskakov@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Mark Hairgrove <mhairgrove@nvidia.com>
Signed-off-by: Sherry Cheung <SCheung@nvidia.com>
Signed-off-by: Subhash Gutti <sgutti@nvidia.com>
---
 include/linux/hmm.h | 22 ++++++++++++-
 mm/hmm.c            | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 674aa79..57e88e4 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -76,10 +76,10 @@
 
 #if IS_ENABLED(CONFIG_HMM)
 
+#include <linux/device.h>
 #include <linux/memremap.h>
 #include <linux/completion.h>
 
-
 struct hmm;
 
 /*
@@ -490,6 +490,26 @@ static inline unsigned long hmm_devmem_page_get_drvdata(struct page *page)
 
 	return drvdata[1];
 }
+
+
+/*
+ * struct hmm_device - fake device to hang device memory onto
+ *
+ * @device: device struct
+ * @minor: device minor number
+ */
+struct hmm_device {
+	struct device		device;
+	unsigned		minor;
+};
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+struct hmm_device *hmm_device_new(void);
+void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4d3b399..df25810 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -24,6 +24,7 @@
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/mmzone.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
@@ -984,4 +985,98 @@ int hmm_devmem_fault_range(struct hmm_devmem *devmem,
 	return 0;
 }
 EXPORT_SYMBOL(hmm_devmem_fault_range);
+
+/*
+ * Device driver that want to handle multiple devices memory through a single
+ * fake device can use hmm_device to do so. This is purely an helper and it
+ * is not needed to make use of any HMM functionality.
+ */
+#define HMM_DEVICE_MAX 256
+
+static DECLARE_BITMAP(hmm_device_mask, HMM_DEVICE_MAX);
+static DEFINE_SPINLOCK(hmm_device_lock);
+static struct class *hmm_device_class;
+static dev_t hmm_device_devt;
+
+static void hmm_device_release(struct device *device)
+{
+	struct hmm_device *hmm_device;
+
+	hmm_device = container_of(device, struct hmm_device, device);
+	spin_lock(&hmm_device_lock);
+	clear_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	kfree(hmm_device);
+}
+
+struct hmm_device *hmm_device_new(void)
+{
+	struct hmm_device *hmm_device;
+	int ret;
+
+	hmm_device = kzalloc(sizeof(*hmm_device), GFP_KERNEL);
+	if (!hmm_device)
+		return ERR_PTR(-ENOMEM);
+
+	ret = alloc_chrdev_region(&hmm_device->device.devt,0,1,"hmm_device");
+	if (ret < 0) {
+		kfree(hmm_device);
+		return NULL;
+	}
+
+	spin_lock(&hmm_device_lock);
+	hmm_device->minor=find_first_zero_bit(hmm_device_mask,HMM_DEVICE_MAX);
+	if (hmm_device->minor >= HMM_DEVICE_MAX) {
+		spin_unlock(&hmm_device_lock);
+		kfree(hmm_device);
+		return NULL;
+	}
+	set_bit(hmm_device->minor, hmm_device_mask);
+	spin_unlock(&hmm_device_lock);
+
+	dev_set_name(&hmm_device->device, "hmm_device%d", hmm_device->minor);
+	hmm_device->device.devt = MKDEV(MAJOR(hmm_device_devt),
+					hmm_device->minor);
+	hmm_device->device.release = hmm_device_release;
+	hmm_device->device.class = hmm_device_class;
+	device_initialize(&hmm_device->device);
+
+	return hmm_device;
+}
+EXPORT_SYMBOL(hmm_device_new);
+
+void hmm_device_put(struct hmm_device *hmm_device)
+{
+	put_device(&hmm_device->device);
+}
+EXPORT_SYMBOL(hmm_device_put);
+
+static int __init hmm_init(void)
+{
+	int ret;
+
+	ret = alloc_chrdev_region(&hmm_device_devt, 0,
+				  HMM_DEVICE_MAX,
+				  "hmm_device");
+	if (ret)
+		return ret;
+
+	hmm_device_class = class_create(THIS_MODULE, "hmm_device");
+	if (IS_ERR(hmm_device_class)) {
+		unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+		return PTR_ERR(hmm_device_class);
+	}
+	return 0;
+}
+
+static void __exit hmm_exit(void)
+{
+	unregister_chrdev_region(hmm_device_devt, HMM_DEVICE_MAX);
+	class_destroy(hmm_device_class);
+}
+
+module_init(hmm_init);
+module_exit(hmm_exit);
+MODULE_LICENSE("GPL");
 #endif /* IS_ENABLED(CONFIG_HMM_DEVMEM) */
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-08 16:21   ` Dave Hansen
@ 2016-12-08 16:39     ` Jerome Glisse
  2016-12-08 20:07       ` Dave Hansen
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2016-12-08 16:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

> On 12/08/2016 08:39 AM, Jérôme Glisse wrote:
> > Architecture that wish to support un-addressable device memory should make
> > sure to never populate the kernel linar mapping for the physical range.
> 
> Does the platform somehow provide a range of physical addresses for this
> unaddressable area?  How do we know no memory will be hot-added in a
> range we're using for unaddressable device memory, for instance?

That's what one of the big issue. No platform does not reserve any range so
there is a possibility that some memory get hotpluged and assign this range.

I pushed the range decision to higher level (ie it is the device driver that
pick one) so right now for device driver using HMM (NVidia close driver as
we don't have nouveau ready for that yet) it goes from the highest physical
address and scan down until finding an empty range big enough.

I don't think i can control or enforce at platform level how to choose
specific physical address for hotplug.

So right now with my patchset what happens is that the hotplug will fail
because i already registered a resource for the physical range. What i can
add is a way to migrate the device memory to a different physical range.
I am bit afraid on how complex this can be.

The ideal solution would be to increase the MAX_PHYSMEM_BITS by one and use
physical address that can never be valid. We would not need to increase the
the direct mapping size of memory (this memory is not mappable by CPU). But
i am afraid of complication this might cause.

I think for sparse memory model it should be easy enough and i already rely
on sparse for HMM.

In any case i think this is something that can be solve after. If it becomes
a real issue. Maybe i should add a debug printk that when hotplug fails
because of an existing un-addressable ZONE_DEVICE resource.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-08 16:39     ` Jerome Glisse
@ 2016-12-08 20:07       ` Dave Hansen
  2016-12-08 20:37         ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Hansen @ 2016-12-08 20:07 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 12/08/2016 08:39 AM, Jerome Glisse wrote:
>> On 12/08/2016 08:39 AM, Jérôme Glisse wrote:
>>> > > Architecture that wish to support un-addressable device memory should make
>>> > > sure to never populate the kernel linar mapping for the physical range.
>> > 
>> > Does the platform somehow provide a range of physical addresses for this
>> > unaddressable area?  How do we know no memory will be hot-added in a
>> > range we're using for unaddressable device memory, for instance?
> That's what one of the big issue. No platform does not reserve any range so
> there is a possibility that some memory get hotpluged and assign this range.
> 
> I pushed the range decision to higher level (ie it is the device driver that
> pick one) so right now for device driver using HMM (NVidia close driver as
> we don't have nouveau ready for that yet) it goes from the highest physical
> address and scan down until finding an empty range big enough.

I don't think you should be stealing physical address space for things
that don't and can't have physical addresses.  Delegating this to
individual device drivers and hoping that they all get it right seems
like a recipe for disaster.

Maybe worth adding to the changelog:

	This feature potentially breaks memory hotplug unless every
	driver using it magically predicts the future addresses of
	where memory will be hotplugged.

BTW, how many more of these "big issues" does this set have?  I didn't
see any mention of this in the changelogs.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-08 20:07       ` Dave Hansen
@ 2016-12-08 20:37         ` Jerome Glisse
  2016-12-26  9:12           ` Anshuman Khandual
  0 siblings, 1 reply; 23+ messages in thread
From: Jerome Glisse @ 2016-12-08 20:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

> On 12/08/2016 08:39 AM, Jerome Glisse wrote:
> >> On 12/08/2016 08:39 AM, Jérôme Glisse wrote:
> >>> > > Architecture that wish to support un-addressable device memory should
> >>> > > make
> >>> > > sure to never populate the kernel linar mapping for the physical
> >>> > > range.
> >> > 
> >> > Does the platform somehow provide a range of physical addresses for this
> >> > unaddressable area?  How do we know no memory will be hot-added in a
> >> > range we're using for unaddressable device memory, for instance?
> > That's what one of the big issue. No platform does not reserve any range so
> > there is a possibility that some memory get hotpluged and assign this
> > range.
> > 
> > I pushed the range decision to higher level (ie it is the device driver
> > that
> > pick one) so right now for device driver using HMM (NVidia close driver as
> > we don't have nouveau ready for that yet) it goes from the highest physical
> > address and scan down until finding an empty range big enough.
> 
> I don't think you should be stealing physical address space for things
> that don't and can't have physical addresses.  Delegating this to
> individual device drivers and hoping that they all get it right seems
> like a recipe for disaster.

Well i expected device driver to use hmm_devmem_add() which does not take
physical address but use the above logic to pick one.

> 
> Maybe worth adding to the changelog:
> 
> 	This feature potentially breaks memory hotplug unless every
> 	driver using it magically predicts the future addresses of
> 	where memory will be hotplugged.

I will add debug printk to memory hotplug in case it fails because of some
un-addressable resource. If you really dislike memory hotplug being broken
then i can go down the way of allowing to hotplug memory above the max
physical memory limit. This require more changes but i believe this is
doable for some of the memory model (sparsemem and sparsemem extreme).

> 
> BTW, how many more of these "big issues" does this set have?  I didn't
> see any mention of this in the changelogs.
 
I am not sure what to say here. If you don't use HMM ie no device that
hotplug it. Then there is no chance of having issue. If you have a device
that use it then someone might try to do something stupid (try to kmap
and access such un-addressable page for instance). So i am not sure where
to draw the line.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-08 20:37         ` Jerome Glisse
@ 2016-12-26  9:12           ` Anshuman Khandual
  2016-12-26 19:02             ` Jerome Glisse
  0 siblings, 1 reply; 23+ messages in thread
From: Anshuman Khandual @ 2016-12-26  9:12 UTC (permalink / raw)
  To: Jerome Glisse, Dave Hansen
  Cc: akpm, linux-kernel, linux-mm, John Hubbard, Dan Williams, Ross Zwisler

On 12/09/2016 02:07 AM, Jerome Glisse wrote:
>> On 12/08/2016 08:39 AM, Jerome Glisse wrote:
>>>> > >> On 12/08/2016 08:39 AM, Jérôme Glisse wrote:
>>>>>>> > >>> > > Architecture that wish to support un-addressable device memory should
>>>>>>> > >>> > > make
>>>>>>> > >>> > > sure to never populate the kernel linar mapping for the physical
>>>>>>> > >>> > > range.
>>>>> > >> > 
>>>>> > >> > Does the platform somehow provide a range of physical addresses for this
>>>>> > >> > unaddressable area?  How do we know no memory will be hot-added in a
>>>>> > >> > range we're using for unaddressable device memory, for instance?
>>> > > That's what one of the big issue. No platform does not reserve any range so
>>> > > there is a possibility that some memory get hotpluged and assign this
>>> > > range.
>>> > > 
>>> > > I pushed the range decision to higher level (ie it is the device driver
>>> > > that
>>> > > pick one) so right now for device driver using HMM (NVidia close driver as
>>> > > we don't have nouveau ready for that yet) it goes from the highest physical
>>> > > address and scan down until finding an empty range big enough.
>> > 
>> > I don't think you should be stealing physical address space for things
>> > that don't and can't have physical addresses.  Delegating this to
>> > individual device drivers and hoping that they all get it right seems
>> > like a recipe for disaster.
> Well i expected device driver to use hmm_devmem_add() which does not take
> physical address but use the above logic to pick one.
> 
>> > 
>> > Maybe worth adding to the changelog:
>> > 
>> > 	This feature potentially breaks memory hotplug unless every
>> > 	driver using it magically predicts the future addresses of
>> > 	where memory will be hotplugged.
> I will add debug printk to memory hotplug in case it fails because of some
> un-addressable resource. If you really dislike memory hotplug being broken
> then i can go down the way of allowing to hotplug memory above the max
> physical memory limit. This require more changes but i believe this is
> doable for some of the memory model (sparsemem and sparsemem extreme).

Did not get that. Hotplug memory request will come within the max physical
memory limit as they are real RAM. The address range also would have been
specified. How it can be added beyond the physical limit irrespective of
which we memory model we use.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory
  2016-12-26  9:12           ` Anshuman Khandual
@ 2016-12-26 19:02             ` Jerome Glisse
  0 siblings, 0 replies; 23+ messages in thread
From: Jerome Glisse @ 2016-12-26 19:02 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Dave Hansen, akpm, linux-kernel, linux-mm, John Hubbard,
	Dan Williams, Ross Zwisler

> On 12/09/2016 02:07 AM, Jerome Glisse wrote:
> >> On 12/08/2016 08:39 AM, Jerome Glisse wrote:
> >>>> > >> On 12/08/2016 08:39 AM, Jérôme Glisse wrote:
> >>>>>>> > >>> > > Architecture that wish to support un-addressable device
> >>>>>>> > >>> > > memory should
> >>>>>>> > >>> > > make
> >>>>>>> > >>> > > sure to never populate the kernel linar mapping for the
> >>>>>>> > >>> > > physical
> >>>>>>> > >>> > > range.
> >>>>> > >> > 
> >>>>> > >> > Does the platform somehow provide a range of physical addresses
> >>>>> > >> > for this
> >>>>> > >> > unaddressable area?  How do we know no memory will be hot-added
> >>>>> > >> > in a
> >>>>> > >> > range we're using for unaddressable device memory, for instance?
> >>> > > That's what one of the big issue. No platform does not reserve any
> >>> > > range so
> >>> > > there is a possibility that some memory get hotpluged and assign this
> >>> > > range.
> >>> > > 
> >>> > > I pushed the range decision to higher level (ie it is the device
> >>> > > driver
> >>> > > that
> >>> > > pick one) so right now for device driver using HMM (NVidia close
> >>> > > driver as
> >>> > > we don't have nouveau ready for that yet) it goes from the highest
> >>> > > physical
> >>> > > address and scan down until finding an empty range big enough.
> >> > 
> >> > I don't think you should be stealing physical address space for things
> >> > that don't and can't have physical addresses.  Delegating this to
> >> > individual device drivers and hoping that they all get it right seems
> >> > like a recipe for disaster.
> > Well i expected device driver to use hmm_devmem_add() which does not take
> > physical address but use the above logic to pick one.
> > 
> >> > 
> >> > Maybe worth adding to the changelog:
> >> > 
> >> > 	This feature potentially breaks memory hotplug unless every
> >> > 	driver using it magically predicts the future addresses of
> >> > 	where memory will be hotplugged.
> > I will add debug printk to memory hotplug in case it fails because of some
> > un-addressable resource. If you really dislike memory hotplug being broken
> > then i can go down the way of allowing to hotplug memory above the max
> > physical memory limit. This require more changes but i believe this is
> > doable for some of the memory model (sparsemem and sparsemem extreme).
> 
> Did not get that. Hotplug memory request will come within the max physical
> memory limit as they are real RAM. The address range also would have been
> specified. How it can be added beyond the physical limit irrespective of
> which we memory model we use.
> 

Maybe what you do not know is that on x86 we do not have resource reserve by the
patform for the device memory (the PCIE bar never cover the whole memory so this
range can not be use).

Right now i pick random unuse physical address range for device memory and thus
real memory might later be hotplug just inside the range i took and hotplug will
fail because i already registered a resource for my device memory. This is an
x86 platform limitation.

Now if i bump the maximum physical memory by one bit than i can hotplug device
memory inside that extra bit range and be sure that i will never have any real
memory conflict (as i am above the architectural limit).

Allowing to bump the maximum physical memory have implication and i can not just
bump MAX_PHYSMEM_BITS as it will have repercusion that i don't want. Now in some
memory model i can allow hotplug to happen above the MAX_PHYSMEM_BITS without
having to change MAX_PHYSMEM_BITS and allowing page_to_pfn() and pfn_to_page()
to work above MAX_PHYSMEM_BITS again without changing it.

Memory model like SPARSEMEM_VMEMMAP are problematic as i would need to change the
kernel virtual memory map for the architecture and it is not something i want to
do.

In the meantime people using HMM are "~happy~" enough with memory hotplug failing.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-12-26 19:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-08 16:39 [HMM v14 00/16] HMM (Heterogeneous Memory Management) v14 Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 01/16] mm/free_hot_cold_page: catch ZONE_DEVICE pages Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 02/16] mm/memory/hotplug: convert device bool to int to allow for more flags v2 Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 03/16] mm/ZONE_DEVICE/devmem_pages_remove: allow early removal of device memory Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 04/16] mm/ZONE_DEVICE/free-page: callback when page is freed Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 05/16] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory Jérôme Glisse
2016-12-08 16:21   ` Dave Hansen
2016-12-08 16:39     ` Jerome Glisse
2016-12-08 20:07       ` Dave Hansen
2016-12-08 20:37         ` Jerome Glisse
2016-12-26  9:12           ` Anshuman Khandual
2016-12-26 19:02             ` Jerome Glisse
2016-12-08 16:39 ` [HMM v14 06/16] mm/ZONE_DEVICE/x86: " Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 07/16] mm/hmm: heterogeneous memory management (HMM for short) Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 08/16] mm/hmm/mirror: mirror process address space on device with HMM helpers Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 09/16] mm/hmm/mirror: helper to snapshot CPU page table Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 10/16] mm/hmm/mirror: device page fault handler Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 11/16] mm/hmm/migrate: support un-addressable ZONE_DEVICE page in migration Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 12/16] mm/hmm/migrate: add new boolean copy flag to migratepage() callback Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 13/16] mm/hmm/migrate: new memory migration helper for use with device memory v2 Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 14/16] mm/hmm/migrate: optimize page map once in vma being migrated Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 15/16] mm/hmm/devmem: device driver helper to hotplug ZONE_DEVICE memory Jérôme Glisse
2016-12-08 16:39 ` [HMM v14 16/16] mm/hmm/devmem: dummy HMM device as an helper for " Jérôme Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).