linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] 'struct page' driver for persistent memory
@ 2015-08-13  3:50 Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" Dan Williams
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, Dave Hansen, david, mingo, linux-mm,
	Ingo Molnar, mgorman, H. Peter Anvin, ross.zwisler, torvalds,
	hch

When we last left this debate [1] it was becoming clear that the
'page-less' approach left too many I/O scenarios off the table.  The
page-less enabling is still useful for avoiding the overhead of struct
page where it is not needed, but in the end, page-backed persistent
memory seems to be a requirement.

With that assumption in place the next debate was where to allocate the
storage for the memmap array, or otherwise reduce the overhead of 'struct
page' with a fancier object like variable length pages.

This series takes the position of mapping persistent memory with
standard 'struct page' and pushes the policy decision of allocating the
storage for the memmap array, from RAM or PMEM, to userspace.  It turns
out the best place to allocate 64-bytes per 4K page will be platform
specific.

If PMEM capacities are low then mapping in RAM is a good choice.
Otherwise, for very large capacities storing the memmap in PMEM might be
a better choice. Yet again, PMEM might not have the performance
characteristics favorable to a high rate of change object like 'struct
page'. The kernel can make a reasonable guess, but it seems we will need
to maintain the ability to override any default.

Outside of the new libvdimm sysfs mechanisms to specify the memmap
allocation policy for a given PMEM device, the core of this
implementation is 'struct vmem_altmap'.  'vmem_altmap' alters the memory
hotplug code to optionally use a reserved PMEM-pfn range rather than
dynamic allocation for the memmap.

Only lightly tested so far to confirm valid pfn_to_page() and
page_address() conversions across a range of persistent memory specified
by 'memmap=ss!nn' (kernel command line option to simulate a PMEM
range).

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000748.html

---

Dan Williams (7):
      x86, mm: ZONE_DEVICE for "device memory"
      x86, mm: introduce struct vmem_altmap
      x86, mm: arch_add_dev_memory()
      mm: register_dev_memmap()
      libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
      libnvdimm, pfn: 'struct page' provider infrastructure
      libnvdimm, pmem: 'struct page' for pmem


 arch/powerpc/mm/init_64.c         |    7 +
 arch/x86/Kconfig                  |   19 ++
 arch/x86/include/uapi/asm/e820.h  |    2 
 arch/x86/kernel/Makefile          |    2 
 arch/x86/kernel/pmem.c            |   79 +--------
 arch/x86/mm/init_64.c             |  160 +++++++++++++-----
 drivers/nvdimm/Kconfig            |   26 +++
 drivers/nvdimm/Makefile           |    5 +
 drivers/nvdimm/btt.c              |    8 -
 drivers/nvdimm/btt_devs.c         |  172 +------------------
 drivers/nvdimm/claim.c            |  201 ++++++++++++++++++++++
 drivers/nvdimm/e820.c             |   86 ++++++++++
 drivers/nvdimm/namespace_devs.c   |   34 +++-
 drivers/nvdimm/nd-core.h          |    9 +
 drivers/nvdimm/nd.h               |   59 ++++++-
 drivers/nvdimm/pfn.h              |   35 ++++
 drivers/nvdimm/pfn_devs.c         |  334 +++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/pmem.c             |  213 +++++++++++++++++++++++-
 drivers/nvdimm/region.c           |    2 
 drivers/nvdimm/region_devs.c      |   19 ++
 include/linux/kmap_pfn.h          |   33 ++++
 include/linux/memory_hotplug.h    |   21 ++
 include/linux/mm.h                |   53 ++++++
 include/linux/mmzone.h            |   23 +++
 mm/kmap_pfn.c                     |  195 ++++++++++++++++++++++
 mm/memory_hotplug.c               |   84 ++++++---
 mm/page_alloc.c                   |   18 ++
 mm/sparse-vmemmap.c               |   60 ++++++-
 mm/sparse.c                       |   44 +++--
 tools/testing/nvdimm/Kbuild       |    7 +
 tools/testing/nvdimm/test/iomap.c |   13 +
 31 files changed, 1673 insertions(+), 350 deletions(-)
 create mode 100644 drivers/nvdimm/claim.c
 create mode 100644 drivers/nvdimm/e820.c
 create mode 100644 drivers/nvdimm/pfn.h
 create mode 100644 drivers/nvdimm/pfn_devs.c

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-14 21:37   ` Jerome Glisse
  2015-08-15 13:33   ` Christoph Hellwig
  2015-08-13  3:50 ` [RFC PATCH 2/7] x86, mm: introduce struct vmem_altmap Dan Williams
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, Dave Hansen, david, mingo, linux-mm,
	Ingo Molnar, mgorman, H. Peter Anvin, ross.zwisler, torvalds,
	hch

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory.  Device memory has
different lifetime and performance characteristics than RAM.  However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

arch_add_memory() is reorganized a bit in preparation for a new
arch_add_dev_memory() api, for now there is no functional change to the
memory hotplug code.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig       |   13 +++++++++++++
 arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
 include/linux/mmzone.h |   23 +++++++++++++++++++++++
 mm/memory_hotplug.c    |    5 ++++-
 mm/page_alloc.c        |    3 +++
 5 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b3a1a5d77d92..64829b17980b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -308,6 +308,19 @@ config ZONE_DMA
 
 	  If unsure, say Y.
 
+config ZONE_DEVICE
+	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
+	default !ZONE_DMA
+	depends on !ZONE_DMA
+	help
+	  Device memory hotplug support allows for establishing pmem,
+	  or other device driver discovered memory regions, in the
+	  memmap. This allows pfn_to_page() lookups of otherwise
+	  "device-physical" addresses which is needed for using a DAX
+	  mapping in an O_DIRECT operation, among other things.
+
+	  If FS_DAX is enabled, then say Y.
+
 config SMP
 	bool "Symmetric multi-processing support"
 	---help---
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..94f0fa56f0ed 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -683,15 +683,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 	}
 }
 
-/*
- * Memory is added always to NORMAL zone. This means you will never get
- * additional DMA/DMA32 memory.
- */
-int arch_add_memory(int nid, u64 start, u64 size)
+static int __arch_add_memory(int nid, u64 start, u64 size, struct zone *zone)
 {
-	struct pglist_data *pgdat = NODE_DATA(nid);
-	struct zone *zone = pgdat->node_zones +
-		zone_for_memory(nid, start, size, ZONE_NORMAL);
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
 	int ret;
@@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
 	ret = __add_pages(nid, zone, start_pfn, nr_pages);
 	WARN_ON_ONCE(ret);
 
-	/* update max_pfn, max_low_pfn and high_memory */
-	update_end_of_memory_vars(start, size);
+	/*
+	 * Update max_pfn, max_low_pfn and high_memory, unless we added
+	 * "device memory" which should not effect max_pfn
+	 */
+	if (!is_dev_zone(zone))
+		update_end_of_memory_vars(start, size);
 
 	return ret;
 }
+
+/*
+ * Memory is added always to NORMAL zone. This means you will never get
+ * additional DMA/DMA32 memory.
+ */
+int arch_add_memory(int nid, u64 start, u64 size)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	struct zone *zone = pgdat->node_zones +
+		zone_for_memory(nid, start, size, ZONE_NORMAL);
+
+	return __arch_add_memory(nid, start, size, zone);
+}
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
 #define PAGE_INUSE 0xFD
@@ -1028,7 +1038,7 @@ int __ref arch_remove_memory(u64 start, u64 size)
 
 	return ret;
 }
-#endif
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c25966a0a..9217fd93c25b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -319,7 +319,11 @@ enum zone_type {
 	ZONE_HIGHMEM,
 #endif
 	ZONE_MOVABLE,
+#ifdef CONFIG_ZONE_DEVICE
+	ZONE_DEVICE,
+#endif
 	__MAX_NR_ZONES
+
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -794,6 +798,25 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 	return !pgdat->node_start_pfn && !pgdat->node_spanned_pages;
 }
 
+static inline int zone_id(const struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+
+	return zone - pgdat->node_zones;
+}
+
+#ifdef CONFIG_ZONE_DEVICE
+static inline bool is_dev_zone(const struct zone *zone)
+{
+	return zone_id(zone) == ZONE_DEVICE;
+}
+#else
+static inline bool is_dev_zone(const struct zone *zone)
+{
+	return false;
+}
+#endif
+
 #include <linux/memory_hotplug.h>
 
 extern struct mutex zonelists_mutex;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 26fbba7d888f..6bc5b755ce98 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -770,7 +770,10 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 
 	start = phys_start_pfn << PAGE_SHIFT;
 	size = nr_pages * PAGE_SIZE;
-	ret = release_mem_region_adjustable(&iomem_resource, start, size);
+
+	/* in the ZONE_DEVICE case device driver owns the memory region */
+	if (!is_dev_zone(zone))
+		ret = release_mem_region_adjustable(&iomem_resource, start, size);
 	if (ret) {
 		resource_size_t endres = start + size - 1;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22b2b7d..0f19b4e18233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -207,6 +207,9 @@ static char * const zone_names[MAX_NR_ZONES] = {
 	 "HighMem",
 #endif
 	 "Movable",
+#ifdef CONFIG_ZONE_DEVICE
+	 "Device",
+#endif
 };
 
 int min_free_kbytes = 1024;


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 2/7] x86, mm: introduce struct vmem_altmap
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 3/7] x86, mm: arch_add_dev_memory() Dan Williams
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, Dave Hansen, david, mingo, linux-mm,
	Ingo Molnar, mgorman, H. Peter Anvin, ross.zwisler, torvalds,
	hch

This is a preparation patch only, no functional changes.  It simply
makes the following patch easier to read.  struct vmem_altmap modifies
the memory hotplug code to enable it to map "device memory" while
allocating the storage for struct page from that same capacity. The
first user of this capability will be the pmem driver for persistent
memory.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/powerpc/mm/init_64.c      |    7 ++++
 arch/x86/mm/init_64.c          |   79 ++++++++++++++++++++++++++--------------
 include/linux/memory_hotplug.h |   17 ++++++++-
 include/linux/mm.h             |   13 ++++++-
 mm/memory_hotplug.c            |   67 +++++++++++++++++++++-------------
 mm/page_alloc.c                |   11 +++++-
 mm/sparse-vmemmap.c            |   29 ++++++++++++---
 mm/sparse.c                    |   29 +++++++++------
 8 files changed, 177 insertions(+), 75 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d747dd7bc90b..e3e367399935 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -404,6 +404,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end)
 		}
 	}
 }
+
+void __ref __vmemmap_free(unsigned long start, unsigned long end,
+		struct vmem_altmap *altmap)
+{
+	WARN_ONCE(altmap, "vmem_altmap support not implemented.\n");
+	return vmemmap_free(start, end);
+}
 #endif
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 94f0fa56f0ed..c2f872a379d2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -683,7 +683,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
 	}
 }
 
-static int __arch_add_memory(int nid, u64 start, u64 size, struct zone *zone)
+static int __arch_add_memory(int nid, u64 start, u64 size, struct zone *zone,
+		struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -691,7 +692,7 @@ static int __arch_add_memory(int nid, u64 start, u64 size, struct zone *zone)
 
 	init_memory_mapping(start, start + size);
 
-	ret = __add_pages(nid, zone, start_pfn, nr_pages);
+	ret = __add_pages_altmap(nid, zone, start_pfn, nr_pages, altmap);
 	WARN_ON_ONCE(ret);
 
 	/*
@@ -714,7 +715,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
 	struct zone *zone = pgdat->node_zones +
 		zone_for_memory(nid, start, size, ZONE_NORMAL);
 
-	return __arch_add_memory(nid, start, size, zone);
+	return __arch_add_memory(nid, start, size, zone, NULL);
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
@@ -758,7 +759,8 @@ static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
 	spin_unlock(&init_mm.page_table_lock);
 }
 
-static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud,
+		struct vmem_altmap *altmap)
 {
 	pmd_t *pmd;
 	int i;
@@ -869,9 +871,9 @@ remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
 		update_page_count(PG_LEVEL_4K, -pages);
 }
 
-static void __meminit
+static void noinline __meminit
 remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
-		 bool direct)
+		 bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next, pages = 0;
 	pte_t *pte_base;
@@ -925,9 +927,9 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 		update_page_count(PG_LEVEL_2M, -pages);
 }
 
-static void __meminit
+static void noinline __meminit
 remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
-		 bool direct)
+		 bool direct, struct vmem_altmap *altmap)
 {
 	unsigned long next, pages = 0;
 	pmd_t *pmd_base;
@@ -972,8 +974,8 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 		}
 
 		pmd_base = (pmd_t *)pud_page_vaddr(*pud);
-		remove_pmd_table(pmd_base, addr, next, direct);
-		free_pmd_table(pmd_base, pud);
+		remove_pmd_table(pmd_base, addr, next, direct, altmap);
+		free_pmd_table(pmd_base, pud, altmap);
 	}
 
 	if (direct)
@@ -982,7 +984,8 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 
 /* start and end are both virtual address. */
 static void __meminit
-remove_pagetable(unsigned long start, unsigned long end, bool direct)
+remove_pagetable(unsigned long start, unsigned long end, bool direct,
+		struct vmem_altmap *altmap)
 {
 	unsigned long next;
 	unsigned long addr;
@@ -998,7 +1001,7 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 			continue;
 
 		pud = (pud_t *)pgd_page_vaddr(*pgd);
-		remove_pud_table(pud, addr, next, direct);
+		remove_pud_table(pud, addr, next, direct, altmap);
 		if (free_pud_table(pud, pgd))
 			pgd_changed = true;
 	}
@@ -1009,9 +1012,15 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct)
 	flush_tlb_all();
 }
 
+void __ref __vmemmap_free(unsigned long start, unsigned long end,
+		struct vmem_altmap *altmap)
+{
+	remove_pagetable(start, end, false, altmap);
+}
+
 void __ref vmemmap_free(unsigned long start, unsigned long end)
 {
-	remove_pagetable(start, end, false);
+	return __vmemmap_free(start, end, NULL);
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
@@ -1021,22 +1030,25 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
 	start = (unsigned long)__va(start);
 	end = (unsigned long)__va(end);
 
-	remove_pagetable(start, end, true);
+	remove_pagetable(start, end, true, NULL);
 }
 
-int __ref arch_remove_memory(u64 start, u64 size)
+static int __ref __arch_remove_memory(u64 start, u64 size, struct zone *zone,
+		struct vmem_altmap *altmap)
 {
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
-	int ret;
-
-	zone = page_zone(pfn_to_page(start_pfn));
 	kernel_physical_mapping_remove(start, start + size);
-	ret = __remove_pages(zone, start_pfn, nr_pages);
-	WARN_ON_ONCE(ret);
+	return __remove_pages_altmap(zone, __phys_to_pfn(start),
+			__phys_to_pfn(size), altmap);
+}
 
-	return ret;
+int __ref arch_remove_memory(u64 start, u64 size)
+{
+	struct zone *zone = page_zone(pfn_to_page(__phys_to_pfn(start)));
+	int rc;
+
+	rc = __arch_remove_memory(start, size, zone, NULL);
+	WARN_ON_ONCE(rc);
+	return rc;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -1244,7 +1256,7 @@ static void __meminitdata *p_start, *p_end;
 static int __meminitdata node_start;
 
 static int __meminit vmemmap_populate_hugepages(unsigned long start,
-						unsigned long end, int node)
+		unsigned long end, int node, struct vmem_altmap *altmap)
 {
 	unsigned long addr;
 	unsigned long next;
@@ -1267,7 +1279,7 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 		if (pmd_none(*pmd)) {
 			void *p;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
 			if (p) {
 				pte_t entry;
 
@@ -1300,12 +1312,18 @@ static int __meminit vmemmap_populate_hugepages(unsigned long start,
 	return 0;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end,
+		int node, struct vmem_altmap *altmap)
 {
 	int err;
 
+	if (!cpu_has_pse && altmap) {
+		pr_warn_once("vmemmap: alternate mapping not supported\n");
+		return -ENXIO;
+	}
+
 	if (cpu_has_pse)
-		err = vmemmap_populate_hugepages(start, end, node);
+		err = vmemmap_populate_hugepages(start, end, node, altmap);
 	else
 		err = vmemmap_populate_basepages(start, end, node);
 	if (!err)
@@ -1313,6 +1331,11 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 	return err;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
+{
+	return __vmemmap_populate(start, end, node, NULL);
+}
+
 #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HAVE_BOOTMEM_INFO_NODE)
 void register_page_bootmem_memmap(unsigned long section_nr,
 				  struct page *start_page, unsigned long size)
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 6ffa0ac7f7d6..48a4e0a5e13d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -11,6 +11,7 @@ struct zone;
 struct pglist_data;
 struct mem_section;
 struct memory_block;
+struct vmem_altmap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 
@@ -101,6 +102,8 @@ extern int try_online_node(int nid);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
 extern int arch_remove_memory(u64 start, u64 size);
+extern int __remove_pages_altmap(struct zone *zone, unsigned long start_pfn,
+	unsigned long nr_pages, struct vmem_altmap *altmap);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -109,6 +112,14 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
 extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages);
 
+/*
+ * Specialized interface for callers that want to control the allocation
+ * of the memmap
+ */
+extern int __add_pages_altmap(int nid, struct zone *zone,
+		unsigned long start_pfn, unsigned long nr_pages,
+		struct vmem_altmap *altmap);
+
 #ifdef CONFIG_NUMA
 extern int memory_add_physaddr_to_nid(u64 start);
 #else
@@ -271,8 +282,10 @@ extern int arch_add_memory(int nid, u64 start, u64 size);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
-extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms);
+extern int sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
+		struct vmem_altmap *altmap);
+extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+		struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 348f69467f54..de44de70e63a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1827,6 +1827,9 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
+extern void __memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
+		enum memmap_context context, struct vmem_altmap *);
+
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
@@ -2212,20 +2215,28 @@ void sparse_mem_maps_populate_node(struct page **map_map,
 				   unsigned long map_count,
 				   int nodeid);
 
+struct vmem_altmap;
 struct page *sparse_mem_map_populate(unsigned long pnum, int nid);
+struct page *sparse_alt_map_populate(unsigned long pnum, int nid,
+		struct vmem_altmap *altmap);
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(pgd_t *pgd, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node);
 void *vmemmap_alloc_block(unsigned long size, int node);
-void *vmemmap_alloc_block_buf(unsigned long size, int node);
+void *vmemmap_alloc_block_buf(unsigned long size, int node,
+		struct vmem_altmap *altmap);
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
 			       int node);
+int __vmemmap_populate(unsigned long start, unsigned long end, int node,
+		struct vmem_altmap *altmap);
 int vmemmap_populate(unsigned long start, unsigned long end, int node);
 void vmemmap_populate_print_last(void);
 #ifdef CONFIG_MEMORY_HOTPLUG
 void vmemmap_free(unsigned long start, unsigned long end);
+void __vmemmap_free(unsigned long start, unsigned long end,
+		struct vmem_altmap *altmap);
 #endif
 void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6bc5b755ce98..d4bcfeaaec37 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -440,7 +440,8 @@ static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long s
 					pgdat->node_start_pfn;
 }
 
-static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
+static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn,
+		struct vmem_altmap *altmap)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nr_pages = PAGES_PER_SECTION;
@@ -459,25 +460,26 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
 	grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
 			phys_start_pfn + nr_pages);
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
-	memmap_init_zone(nr_pages, nid, zone_type,
-			 phys_start_pfn, MEMMAP_HOTPLUG);
+	__memmap_init_zone(nr_pages, nid, zone_type,
+			 phys_start_pfn, MEMMAP_HOTPLUG, altmap);
 	return 0;
 }
 
 static int __meminit __add_section(int nid, struct zone *zone,
-					unsigned long phys_start_pfn)
+		unsigned long phys_start_pfn,
+		struct vmem_altmap *altmap)
 {
 	int ret;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
 
-	ret = sparse_add_one_section(zone, phys_start_pfn);
+	ret = sparse_add_one_section(zone, phys_start_pfn, altmap);
 
 	if (ret < 0)
 		return ret;
 
-	ret = __add_zone(zone, phys_start_pfn);
+	ret = __add_zone(zone, phys_start_pfn, altmap);
 
 	if (ret < 0)
 		return ret;
@@ -491,18 +493,20 @@ static int __meminit __add_section(int nid, struct zone *zone,
  * call this function after deciding the zone to which to
  * add the new pages.
  */
-int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
-			unsigned long nr_pages)
+int __ref __add_pages_altmap(int nid, struct zone *zone,
+		unsigned long phys_start_pfn, unsigned long nr_pages,
+		struct vmem_altmap *altmap)
 {
 	unsigned long i;
 	int err = 0;
 	int start_sec, end_sec;
+
 	/* during initialize mem_map, align hot-added range to section */
 	start_sec = pfn_to_section_nr(phys_start_pfn);
 	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
 
 	for (i = start_sec; i <= end_sec; i++) {
-		err = __add_section(nid, zone, section_nr_to_pfn(i));
+		err = __add_section(nid, zone, section_nr_to_pfn(i), altmap);
 
 		/*
 		 * EEXIST is finally dealt with by ioresource collision
@@ -517,6 +521,12 @@ int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
 
 	return err;
 }
+
+int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
+			unsigned long nr_pages)
+{
+	return __add_pages_altmap(nid, zone, phys_start_pfn, nr_pages, NULL);
+}
 EXPORT_SYMBOL_GPL(__add_pages);
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
@@ -722,7 +732,8 @@ static void __remove_zone(struct zone *zone, unsigned long start_pfn)
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 }
 
-static int __remove_section(struct zone *zone, struct mem_section *ms)
+static int __remove_section(struct zone *zone, struct mem_section *ms,
+		struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn;
 	int scn_nr;
@@ -739,23 +750,12 @@ static int __remove_section(struct zone *zone, struct mem_section *ms)
 	start_pfn = section_nr_to_pfn(scn_nr);
 	__remove_zone(zone, start_pfn);
 
-	sparse_remove_one_section(zone, ms);
+	sparse_remove_one_section(zone, ms, altmap);
 	return 0;
 }
 
-/**
- * __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
- * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
- * @nr_pages: number of pages to remove (must be multiple of section size)
- *
- * Generic helper function to remove section mappings and sysfs entries
- * for the section of the memory we are removing. Caller needs to make
- * sure that pages are marked reserved and zones are adjust properly by
- * calling offline_pages().
- */
-int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
-		 unsigned long nr_pages)
+int __remove_pages_altmap(struct zone *zone, unsigned long phys_start_pfn,
+		 unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	unsigned long i;
 	int sections_to_remove;
@@ -784,12 +784,29 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 	sections_to_remove = nr_pages / PAGES_PER_SECTION;
 	for (i = 0; i < sections_to_remove; i++) {
 		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
-		ret = __remove_section(zone, __pfn_to_section(pfn));
+		ret = __remove_section(zone, __pfn_to_section(pfn), altmap);
 		if (ret)
 			break;
 	}
 	return ret;
 }
+
+/**
+ * __remove_pages() - remove sections of pages from a zone
+ * @zone: zone from which pages need to be removed
+ * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
+ * @nr_pages: number of pages to remove (must be multiple of section size)
+ *
+ * Generic helper function to remove section mappings and sysfs entries
+ * for the section of the memory we are removing. Caller needs to make
+ * sure that pages are marked reserved and zones are adjust properly by
+ * calling offline_pages().
+ */
+int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
+		 unsigned long nr_pages)
+{
+	return __remove_pages_altmap(zone, phys_start_pfn, nr_pages, NULL);
+}
 EXPORT_SYMBOL_GPL(__remove_pages);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f19b4e18233..c18520831dbc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4577,8 +4577,9 @@ static void setup_zone_migrate_reserve(struct zone *zone)
  * up by free_all_bootmem() once the early boot process is
  * done. Non-atomic initialization, single-pass.
  */
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
-		unsigned long start_pfn, enum memmap_context context)
+void __meminit __memmap_init_zone(unsigned long size, int nid,
+		unsigned long zone, unsigned long start_pfn,
+		enum memmap_context context, struct vmem_altmap *altmap)
 {
 	pg_data_t *pgdat = NODE_DATA(nid);
 	unsigned long end_pfn = start_pfn + size;
@@ -4631,6 +4632,12 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 	}
 }
 
+void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
+		unsigned long start_pfn, enum memmap_context context)
+{
+	return __memmap_init_zone(size, nid, zone, start_pfn, context, NULL);
+}
+
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	unsigned int order, t;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 4cba9c2783a1..16ec1675b793 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -69,8 +69,7 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 				__pa(MAX_DMA_ADDRESS));
 }
 
-/* need to make sure size is all the same during early stage */
-void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
+static void * __meminit __vmemmap_alloc_block_buf(unsigned long size, int node)
 {
 	void *ptr;
 
@@ -87,6 +86,13 @@ void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node)
 	return ptr;
 }
 
+/* need to make sure size is all the same during early stage */
+void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node,
+		struct vmem_altmap *altmap)
+{
+	return __vmemmap_alloc_block_buf(size, node);
+}
+
 void __meminit vmemmap_verify(pte_t *pte, int node,
 				unsigned long start, unsigned long end)
 {
@@ -103,7 +109,7 @@ pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node)
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p = vmemmap_alloc_block_buf(PAGE_SIZE, node);
+		void *p = __vmemmap_alloc_block_buf(PAGE_SIZE, node);
 		if (!p)
 			return NULL;
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
@@ -176,7 +182,15 @@ int __meminit vmemmap_populate_basepages(unsigned long start,
 	return 0;
 }
 
-struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
+__weak int __vmemmap_populate(unsigned long start, unsigned long end, int node,
+		struct vmem_altmap *altmap)
+{
+	pr_warn_once("%s: arch does not support vmem_altmap\n", __func__);
+	return -ENOMEM;
+}
+
+struct page * __meminit sparse_alt_map_populate(unsigned long pnum, int nid,
+		struct vmem_altmap *altmap)
 {
 	unsigned long start;
 	unsigned long end;
@@ -186,12 +200,17 @@ struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
 	start = (unsigned long)map;
 	end = (unsigned long)(map + PAGES_PER_SECTION);
 
-	if (vmemmap_populate(start, end, nid))
+	if (__vmemmap_populate(start, end, nid, altmap))
 		return NULL;
 
 	return map;
 }
 
+struct page * __meminit sparse_mem_map_populate(unsigned long pnum, int nid)
+{
+	return sparse_alt_map_populate(pnum, nid, NULL);
+}
+
 void __init sparse_mem_maps_populate_node(struct page **map_map,
 					  unsigned long pnum_begin,
 					  unsigned long pnum_end,
diff --git a/mm/sparse.c b/mm/sparse.c
index d1b48b691ac8..eda783903b1d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -595,17 +595,19 @@ void __init sparse_init(void)
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid)
+static struct page *alloc_section_memmap(unsigned long pnum, int nid,
+		struct vmem_altmap *altmap)
 {
-	/* This will make the necessary allocations eventually. */
 	return sparse_mem_map_populate(pnum, nid);
 }
-static void __kfree_section_memmap(struct page *memmap)
+
+static inline void free_section_memmap(struct page *memmap,
+		struct vmem_altmap *altmap)
 {
 	unsigned long start = (unsigned long)memmap;
 	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
-	vmemmap_free(start, end);
+	__vmemmap_free(start, end, NULL);
 }
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void free_map_bootmem(struct page *memmap)
@@ -690,7 +692,8 @@ static void free_map_bootmem(struct page *memmap)
  * set.  If this is <=0, then that means that the passed-in
  * map was not consumed and must be freed.
  */
-int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
+int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
+		struct vmem_altmap *altmap)
 {
 	unsigned long section_nr = pfn_to_section_nr(start_pfn);
 	struct pglist_data *pgdat = zone->zone_pgdat;
@@ -707,12 +710,12 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn)
 	ret = sparse_index_init(section_nr, pgdat->node_id);
 	if (ret < 0 && ret != -EEXIST)
 		return ret;
-	memmap = kmalloc_section_memmap(section_nr, pgdat->node_id);
+	memmap = alloc_section_memmap(section_nr, pgdat->node_id, altmap);
 	if (!memmap)
 		return -ENOMEM;
 	usemap = __kmalloc_section_usemap();
 	if (!usemap) {
-		__kfree_section_memmap(memmap);
+		free_section_memmap(memmap, altmap);
 		return -ENOMEM;
 	}
 
@@ -734,7 +737,7 @@ out:
 	pgdat_resize_unlock(pgdat, &flags);
 	if (ret <= 0) {
 		kfree(usemap);
-		__kfree_section_memmap(memmap);
+		free_section_memmap(memmap, altmap);
 	}
 	return ret;
 }
@@ -761,7 +764,8 @@ static inline void clear_hwpoisoned_pages(struct page *memmap, int nr_pages)
 }
 #endif
 
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+		struct vmem_altmap *altmap)
 {
 	struct page *usemap_page;
 
@@ -775,7 +779,7 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 	if (PageSlab(usemap_page) || PageCompound(usemap_page)) {
 		kfree(usemap);
 		if (memmap)
-			__kfree_section_memmap(memmap);
+			free_section_memmap(memmap, altmap);
 		return;
 	}
 
@@ -788,7 +792,8 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
+void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+		struct vmem_altmap *altmap)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
@@ -805,7 +810,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
 	pgdat_resize_unlock(pgdat, &flags);
 
 	clear_hwpoisoned_pages(memmap, PAGES_PER_SECTION);
-	free_section_usemap(memmap, usemap);
+	free_section_usemap(memmap, usemap, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 #endif /* CONFIG_MEMORY_HOTPLUG */


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 3/7] x86, mm: arch_add_dev_memory()
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 2/7] x86, mm: introduce struct vmem_altmap Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 4/7] mm: register_dev_memmap() Dan Williams
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, Dave Hansen, david, mingo, linux-mm,
	Ingo Molnar, mgorman, H. Peter Anvin, ross.zwisler, torvalds,
	hch

Use struct vmem_altmap to augment vmemmap_{populate|free}().

In support of providing struct page coverage for persistent memory,
use struct vmem_altmap to change the default policy for mapping pfns for
a page range.  The default vmemmap_populate() allocates page table
storage area from the page allocator.  In support of storing struct page
infrastructure on device memory (pmem) directly vmem_altmap directs
vmmemap_populate() to use a pre-allocated block of contiguous pfns for
storage of the new vmemmap entries.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/mm/init_64.c          |   55 +++++++++++++++++++++++++++++++++++++---
 include/linux/memory_hotplug.h |    4 +++
 include/linux/mm.h             |   38 +++++++++++++++++++++++++++-
 mm/memory_hotplug.c            |   12 +++++++++
 mm/page_alloc.c                |    4 +++
 mm/sparse-vmemmap.c            |   31 +++++++++++++++++++++++
 mm/sparse.c                    |   17 +++++++++++-
 7 files changed, 154 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index c2f872a379d2..eda65ec8484e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -719,6 +719,21 @@ int arch_add_memory(int nid, u64 start, u64 size)
 }
 EXPORT_SYMBOL_GPL(arch_add_memory);
 
+#ifdef CONFIG_ZONE_DEVICE
+/*
+ * The primary difference vs arch_add_memory is that the zone is known
+ * apriori.
+ */
+int arch_add_dev_memory(int nid, u64 start, u64 size,
+		struct vmem_altmap *altmap)
+{
+	struct pglist_data *pgdat = NODE_DATA(nid);
+	struct zone *zone = pgdat->node_zones + ZONE_DEVICE;
+
+	return __arch_add_memory(nid, start, size, zone, altmap);
+}
+#endif
+
 #define PAGE_INUSE 0xFD
 
 static void __meminit free_pagetable(struct page *page, int order)
@@ -771,8 +786,13 @@ static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud,
 			return;
 	}
 
-	/* free a pmd talbe */
-	free_pagetable(pud_page(*pud), 0);
+	/*
+	 * Free a pmd table if it came from the page allocator (i.e. !altmap).
+	 * In the altmap case the pages are being freed implicitly by the
+	 * section becoming unmapped / unplugged.
+	 */
+	if (!altmap)
+		free_pagetable(pud_page(*pud), 0);
 	spin_lock(&init_mm.page_table_lock);
 	pud_clear(pud);
 	spin_unlock(&init_mm.page_table_lock);
@@ -890,7 +910,7 @@ remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
 		if (pmd_large(*pmd)) {
 			if (IS_ALIGNED(addr, PMD_SIZE) &&
 			    IS_ALIGNED(next, PMD_SIZE)) {
-				if (!direct)
+				if (!direct && !altmap)
 					free_pagetable(pmd_page(*pmd),
 						       get_order(PMD_SIZE));
 
@@ -946,7 +966,7 @@ remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
 		if (pud_large(*pud)) {
 			if (IS_ALIGNED(addr, PUD_SIZE) &&
 			    IS_ALIGNED(next, PUD_SIZE)) {
-				if (!direct)
+				if (!direct && !altmap)
 					free_pagetable(pud_page(*pud),
 						       get_order(PUD_SIZE));
 
@@ -993,6 +1013,8 @@ remove_pagetable(unsigned long start, unsigned long end, bool direct,
 	pud_t *pud;
 	bool pgd_changed = false;
 
+	WARN_ON_ONCE(direct && altmap);
+
 	for (addr = start; addr < end; addr = next) {
 		next = pgd_addr_end(addr, end);
 
@@ -1041,6 +1063,31 @@ static int __ref __arch_remove_memory(u64 start, u64 size, struct zone *zone,
 			__phys_to_pfn(size), altmap);
 }
 
+int __ref arch_remove_dev_memory(u64 start, u64 size,
+		struct vmem_altmap *altmap)
+{
+	unsigned long pfn = __phys_to_pfn(start);
+	struct zone *zone;
+	int rc;
+
+	/*
+	 * Reserve pages will not have initialized pfns, so we need to
+	 * calulate the page zone from the first valid pfn.
+	 */
+	if (altmap) {
+		if (altmap->base_pfn != pfn) {
+			WARN_ONCE(1, "pfn: %#lx expected: %#lx\n",
+					pfn, altmap->base_pfn);
+			return -EINVAL;
+		}
+		pfn += altmap->reserve;
+	}
+	zone = page_zone(pfn_to_page(pfn));
+	rc = __arch_remove_memory(start, size, zone, altmap);
+	WARN_ON_ONCE(rc);
+	return rc;
+}
+
 int __ref arch_remove_memory(u64 start, u64 size)
 {
 	struct zone *zone = page_zone(pfn_to_page(__phys_to_pfn(start)));
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 48a4e0a5e13d..6a9f05e2c02f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -102,6 +102,8 @@ extern int try_online_node(int nid);
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
 extern int arch_remove_memory(u64 start, u64 size);
+extern int arch_remove_dev_memory(u64 start, u64 size,
+		struct vmem_altmap *altmap);
 extern int __remove_pages_altmap(struct zone *zone, unsigned long start_pfn,
 	unsigned long nr_pages, struct vmem_altmap *altmap);
 extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
@@ -279,6 +281,8 @@ extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 extern int add_memory(int nid, u64 start, u64 size);
 extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default);
 extern int arch_add_memory(int nid, u64 start, u64 size);
+extern int arch_add_dev_memory(int nid, u64 start, u64 size,
+		struct vmem_altmap *altmap);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
 extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index de44de70e63a..8a4f24d7fdb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2215,7 +2215,43 @@ void sparse_mem_maps_populate_node(struct page **map_map,
 				   unsigned long map_count,
 				   int nodeid);
 
-struct vmem_altmap;
+/**
+ * struct vmem_altmap - augment vmemap_populate with pre-allocated pte storage
+ * @base: first pfn of the allocation
+ * @reserve: number of pfns reserved by the device relative to base
+ * @free: range of memmap storage / offset to data from section0
+ * @alloc: tracks num pfns consumed for page map, private to vmemmap_populate()
+ */
+struct vmem_altmap {
+	const unsigned long base_pfn;
+	const unsigned long reserve;
+	unsigned long free;
+	unsigned long alloc;
+};
+
+static inline unsigned long vmem_altmap_nr_free(struct vmem_altmap *altmap)
+{
+	if (altmap->free > altmap->alloc)
+		return altmap->free - altmap->alloc;
+	return 0;
+}
+
+static inline unsigned long vmem_altmap_next_pfn(struct vmem_altmap *altmap)
+{
+	return altmap->base_pfn + altmap->alloc;
+}
+
+static inline unsigned long vmem_altmap_alloc(struct vmem_altmap *altmap,
+		unsigned long nr_pfns)
+{
+	unsigned long pfn = vmem_altmap_next_pfn(altmap);
+
+	if (nr_pfns > vmem_altmap_nr_free(altmap))
+		return ULONG_MAX;
+	altmap->alloc += nr_pfns;
+	return pfn;
+}
+
 struct page *sparse_mem_map_populate(unsigned long pnum, int nid);
 struct page *sparse_alt_map_populate(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4bcfeaaec37..79cb7595b659 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -505,6 +505,18 @@ int __ref __add_pages_altmap(int nid, struct zone *zone,
 	start_sec = pfn_to_section_nr(phys_start_pfn);
 	end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
 
+	if (altmap) {
+		/*
+		 * Validate altmap is within bounds of the total request
+		 */
+		if (altmap->base_pfn != phys_start_pfn || (altmap->reserve
+					+ altmap->free) > nr_pages) {
+			pr_warn_once("memory add fail, invalid altmap\n");
+			return -EINVAL;
+		}
+		altmap->alloc = 0;
+	}
+
 	for (i = start_sec; i <= end_sec; i++) {
 		err = __add_section(nid, zone, section_nr_to_pfn(i), altmap);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c18520831dbc..498193b8811d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4590,6 +4590,10 @@ void __meminit __memmap_init_zone(unsigned long size, int nid,
 	if (highest_memmap_pfn < end_pfn - 1)
 		highest_memmap_pfn = end_pfn - 1;
 
+	/* skip initializing a number of pfns from the start of the section */
+	if (altmap && start_pfn == altmap->base_pfn)
+		start_pfn += altmap->reserve;
+
 	z = &pgdat->node_zones[zone];
 	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
 		/*
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 16ec1675b793..6ea8027daf00 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -86,10 +86,41 @@ static void * __meminit __vmemmap_alloc_block_buf(unsigned long size, int node)
 	return ptr;
 }
 
+static void * __meminit altmap_alloc_block_buf(unsigned long size,
+		struct vmem_altmap *altmap)
+{
+	unsigned long pfn, start_pfn = vmem_altmap_next_pfn(altmap);
+	unsigned long align = 0;
+	void *ptr;
+
+	if (!is_power_of_2(size) || size < PAGE_SIZE) {
+		pr_warn_once("%s: allocations must be multiple of PAGE_SIZE (%ld)\n",
+				__func__, PAGE_SIZE);
+		return NULL;
+	}
+
+	size >>= PAGE_SHIFT;
+	if (start_pfn & (size - 1))
+		align = ALIGN(start_pfn, size) - start_pfn;
+
+	pfn = vmem_altmap_alloc(altmap, align + size);
+	if (pfn < ULONG_MAX)
+		ptr = __va(__pfn_to_phys(pfn));
+	else
+		ptr = NULL;
+	pr_debug("%s: start: %#lx align: %#lx next: %#lx nr: %#lx %p\n",
+			__func__, start_pfn, align,
+			vmem_altmap_next_pfn(altmap), size + align, ptr);
+
+	return ptr;
+}
+
 /* need to make sure size is all the same during early stage */
 void * __meminit vmemmap_alloc_block_buf(unsigned long size, int node,
 		struct vmem_altmap *altmap)
 {
+	if (altmap)
+		return altmap_alloc_block_buf(size, altmap);
 	return __vmemmap_alloc_block_buf(size, node);
 }
 
diff --git a/mm/sparse.c b/mm/sparse.c
index eda783903b1d..529b16509eca 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -369,6 +369,13 @@ static void __init sparse_early_usemaps_alloc_node(void *data,
 }
 
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
+struct page __init *sparse_alt_map_populate(unsigned long pnum, int nid,
+		struct vmem_altmap *altmap)
+{
+	pr_warn_once("%s: requires CONFIG_SPARSEMEM_VMEMMAP=y\n", __func__);
+	return NULL;
+}
+
 struct page __init *sparse_mem_map_populate(unsigned long pnum, int nid)
 {
 	struct page *map;
@@ -598,7 +605,10 @@ void __init sparse_init(void)
 static struct page *alloc_section_memmap(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap)
 {
-	return sparse_mem_map_populate(pnum, nid);
+	if (altmap)
+		return sparse_alt_map_populate(pnum, nid, altmap);
+	else
+		return sparse_mem_map_populate(pnum, nid);
 }
 
 static inline void free_section_memmap(struct page *memmap,
@@ -607,7 +617,10 @@ static inline void free_section_memmap(struct page *memmap,
 	unsigned long start = (unsigned long)memmap;
 	unsigned long end = (unsigned long)(memmap + PAGES_PER_SECTION);
 
-	__vmemmap_free(start, end, NULL);
+	if (altmap)
+		__vmemmap_free(start, end, altmap);
+	else
+		__vmemmap_free(start, end, NULL);
 }
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static void free_map_bootmem(struct page *memmap)


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 4/7] mm: register_dev_memmap()
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
                   ` (2 preceding siblings ...)
  2015-08-13  3:50 ` [RFC PATCH 3/7] x86, mm: arch_add_dev_memory() Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-15  9:04   ` Christoph Hellwig
  2015-08-13  3:50 ` [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option Dan Williams
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, Dave Hansen, david, mingo, linux-mm,
	Ingo Molnar, mgorman, H. Peter Anvin, ross.zwisler, torvalds,
	hch

Provide an interface for device drivers to register physical memory

The register_dev_memmap() api enables a device driver like pmem to setup
struct page entries for the memory it has discovered.  While this
mechanism is motivated by the desire to use persistent memory outside of
the block I/O and direct access (DAX) paths, this mechanism is generic
for any physical range that is not marked as RAM at boot.

Given capacities for the registered memory range may be too large to
house the memmap in RAM, this interface allows for the memmap to be
allocated from the new range being registered.  The pmem driver uses
this capability to let userspace policy determine the placement of the
memmap for peristent memory.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/kmap_pfn.h |   33 ++++++++
 include/linux/mm.h       |    4 +
 mm/kmap_pfn.c            |  195 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+), 1 deletion(-)

diff --git a/include/linux/kmap_pfn.h b/include/linux/kmap_pfn.h
index fa44971d8e95..2dfad83337ba 100644
--- a/include/linux/kmap_pfn.h
+++ b/include/linux/kmap_pfn.h
@@ -4,7 +4,9 @@
 #include <linux/highmem.h>
 
 struct device;
+struct dev_map;
 struct resource;
+struct vmem_altmap;
 #ifdef CONFIG_KMAP_PFN
 extern void *kmap_atomic_pfn_t(__pfn_t pfn);
 extern void kunmap_atomic_pfn_t(void *addr);
@@ -28,4 +30,35 @@ static inline int devm_register_kmap_pfn_range(struct device *dev,
 }
 #endif /* CONFIG_KMAP_PFN */
 
+#ifdef CONFIG_ZONE_DEVICE
+struct dev_map *__register_dev_memmap(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap, struct module *mod);
+void unregister_dev_memmap(struct dev_map *dev_map);
+struct dev_map * __must_check try_pin_devpfn_range(__pfn_t pfn);
+void unpin_devpfn_range(struct dev_map *dev_map);
+#else
+static inline struct dev_map *__register_dev_memmap(struct device *dev,
+		struct resource *res, struct vmem_altmap *altmap,
+		struct module *mod)
+{
+	return NULL;
+}
+
+static inline void unregister_dev_memmap(struct dev_map *dev_map)
+{
+}
+
+static inline struct dev_map * __must_check try_pin_devpfn_range(__pfn_t pfn)
+{
+	return NULL;
+}
+
+static inline void unpin_devpfn_range(struct dev_map *dev_map)
+{
+}
+#endif
+
+#define register_dev_memmap(d, r, a) \
+__register_dev_memmap((d), (r), (a), THIS_MODULE)
+
 #endif /* _LINUX_KMAP_PFN_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8a4f24d7fdb0..07152a54b841 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -939,6 +939,7 @@ typedef struct {
  * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
  * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
  * PFN_DEV - pfn is not covered by system memmap
+ * PFN_MAP - pfn is covered by a device specific memmap
  */
 enum {
 	PFN_MASK = (1UL << PAGE_SHIFT) - 1,
@@ -949,6 +950,7 @@ enum {
 #else
 	PFN_DEV = 0,
 #endif
+	PFN_MAP = (1UL << 3),
 };
 
 static inline __pfn_t pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
@@ -965,7 +967,7 @@ static inline __pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
 
 static inline bool __pfn_t_has_page(__pfn_t pfn)
 {
-	return (pfn.val & PFN_DEV) == 0;
+	return (pfn.val & PFN_DEV) == 0 || (pfn.val & PFN_MAP) == PFN_MAP;
 }
 
 static inline unsigned long __pfn_t_to_pfn(__pfn_t pfn)
diff --git a/mm/kmap_pfn.c b/mm/kmap_pfn.c
index 2d58e167dfbc..d60ac7463454 100644
--- a/mm/kmap_pfn.c
+++ b/mm/kmap_pfn.c
@@ -10,16 +10,36 @@
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  * General Public License for more details.
  */
+#include <linux/percpu-refcount.h>
+#include <linux/kmap_pfn.h>
 #include <linux/rcupdate.h>
 #include <linux/rculist.h>
 #include <linux/highmem.h>
 #include <linux/device.h>
+#include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
 
 static LIST_HEAD(ranges);
+static LIST_HEAD(dev_maps);
 static DEFINE_MUTEX(register_lock);
+static DECLARE_WAIT_QUEUE_HEAD(dev_map_wq);
+
+#ifndef CONFIG_MEMORY_HOTPLUG
+int __weak arch_add_dev_memory(int nid, u64 start, u64 size,
+		struct vmem_altmap *altmap)
+{
+	return -ENXIO;
+}
+#endif
+
+#ifndef CONFIG_MEMORY_HOTREMOVE
+int __weak arch_remove_dev_memory(u64 start, u64 size)
+{
+	return -ENXIO;
+}
+#endif
 
 struct kmap {
 	struct list_head list;
@@ -28,6 +48,22 @@ struct kmap {
 	void *base;
 };
 
+enum {
+	DEV_MAP_LIVE,
+	DEV_MAP_CONFIRM,
+};
+
+struct dev_map {
+	struct list_head list;
+	resource_size_t base;
+	resource_size_t end;
+	struct percpu_ref percpu_ref;
+	struct device *dev;
+	struct module *module;
+	struct vmem_altmap *altmap;
+	unsigned long flags;
+};
+
 static void teardown_kmap(void *data)
 {
 	struct kmap *kmap = data;
@@ -115,3 +151,162 @@ void kunmap_atomic_pfn_t(void *addr)
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(kunmap_atomic_pfn_t);
+
+#ifdef CONFIG_ZONE_DEVICE
+static struct dev_map *to_dev_map(struct percpu_ref *ref)
+{
+	return container_of(ref, struct dev_map, percpu_ref);
+}
+
+static void dev_map_release(struct percpu_ref *ref)
+{
+	struct dev_map *dev_map = to_dev_map(ref);
+
+	/* signal dev_map is idle (no more refs) */
+	clear_bit(DEV_MAP_LIVE, &dev_map->flags);
+	wake_up_all(&dev_map_wq);
+}
+
+static void dev_map_confirm(struct percpu_ref *ref)
+{
+	struct dev_map *dev_map = to_dev_map(ref);
+
+	/* signal dev_map is confirmed dead (slow path ref mode) */
+	set_bit(DEV_MAP_CONFIRM, &dev_map->flags);
+	wake_up_all(&dev_map_wq);
+}
+
+static void kill_dev_map(struct dev_map *dev_map)
+{
+	percpu_ref_kill_and_confirm(&dev_map->percpu_ref, dev_map_confirm);
+	wait_event(dev_map_wq, test_bit(DEV_MAP_CONFIRM, &dev_map->flags));
+}
+
+struct dev_map *__register_dev_memmap(struct device *dev, struct resource *res,
+		struct vmem_altmap *altmap, struct module *mod)
+{
+	struct dev_map *dev_map;
+	int rc, nid;
+
+	if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)
+			&& IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
+		/* pass */;
+	else
+		return NULL;
+
+	dev_map = kzalloc(sizeof(*dev_map), GFP_KERNEL);
+	if (!dev_map)
+		return NULL;
+
+	if (altmap) {
+		dev_map->altmap = kmemdup(altmap, sizeof(*altmap), GFP_KERNEL);
+		if (!dev_map->altmap)
+			goto err_altmap;
+	}
+
+	if (!try_module_get(mod))
+		goto err_mod;
+
+	nid = dev_to_node(dev);
+	if (nid < 0)
+		nid = 0;
+	INIT_LIST_HEAD(&dev_map->list);
+	dev_map->dev = dev;
+	dev_map->base = res->start;
+	dev_map->end = res->end;
+	dev_map->module = mod;
+	set_bit(DEV_MAP_LIVE, &dev_map->flags);
+	if (percpu_ref_init(&dev_map->percpu_ref, dev_map_release, 0,
+				GFP_KERNEL))
+		goto err_ref;
+
+	mutex_lock(&register_lock);
+	list_add_rcu(&dev_map->list, &dev_maps);
+	mutex_unlock(&register_lock);
+
+	rc = arch_add_dev_memory(nid, res->start, resource_size(res), altmap);
+	if (rc) {
+		/*
+		 * It is safe to delete here without checking percpu_ref
+		 * since this dev_map is established before
+		 * ->direct_access() has advertised this pfn range to
+		 *  other parts of the kernel.
+		 */
+		mutex_lock(&register_lock);
+		list_del_rcu(&dev_map->list);
+		mutex_unlock(&register_lock);
+		synchronize_rcu();
+		goto err_add;
+	}
+
+	return dev_map;
+
+ err_add:
+	kill_dev_map(dev_map);
+ err_ref:
+	module_put(mod);
+ err_mod:
+	kfree(dev_map->altmap);
+ err_altmap:
+	kfree(dev_map);
+	return NULL;
+
+}
+EXPORT_SYMBOL_GPL(__register_dev_memmap);
+
+void unregister_dev_memmap(struct dev_map *dev_map)
+{
+	u64 size;
+
+	if (!dev_map)
+		return;
+
+	/* block new references */
+	kill_dev_map(dev_map);
+
+	/* block new lookups */
+	mutex_lock(&register_lock);
+	list_del_rcu(&dev_map->list);
+	mutex_unlock(&register_lock);
+
+	/* flush pending lookups, and wait for pinned ranges */
+	synchronize_rcu();
+	wait_event(dev_map_wq, !test_bit(DEV_MAP_LIVE, &dev_map->flags));
+
+	/* pages are dead and unused, undo the arch mapping */
+	size = dev_map->end - dev_map->base + 1;
+	arch_remove_dev_memory(dev_map->base, size, dev_map->altmap);
+	module_put(dev_map->module);
+	kfree(dev_map->altmap);
+	kfree(dev_map);
+}
+EXPORT_SYMBOL_GPL(unregister_dev_memmap);
+
+struct dev_map * __must_check try_pin_devpfn_range(__pfn_t pfn)
+{
+	phys_addr_t addr = __pfn_t_to_phys(pfn);
+	struct dev_map *ret = NULL;
+	struct dev_map *dev_map;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(dev_map, &dev_maps, list) {
+		if (addr >= dev_map->base && addr <= dev_map->end) {
+			if (percpu_ref_tryget_live(&dev_map->percpu_ref))
+				ret = dev_map;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(try_pin_devpfn_range);
+
+void unpin_devpfn_range(struct dev_map *dev_map)
+{
+	if (dev_map)
+		percpu_ref_put(&dev_map->percpu_ref);
+
+}
+EXPORT_SYMBOL_GPL(unpin_devpfn_range);
+#endif /* ZONE_DEVICE */


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
                   ` (3 preceding siblings ...)
  2015-08-13  3:50 ` [RFC PATCH 4/7] mm: register_dev_memmap() Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-15  9:06   ` Christoph Hellwig
  2015-08-13  3:50 ` [RFC PATCH 6/7] libnvdimm, pfn: 'struct page' provider infrastructure Dan Williams
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, david, mingo, linux-mm, mgorman,
	ross.zwisler, torvalds, hch

Purely for ease of testing, with this in place we can run the unit test
alongside any tests that depend on the memmap=ss!nn kernel parameter.
The unit test mocking implementation requires that libnvdimm be a module
and not built-in.

A nice side effect is the implementation is a bit more generic as it no
longer depends on <asm/e820.h>.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig                 |    6 ++-
 arch/x86/include/uapi/asm/e820.h |    2 -
 arch/x86/kernel/Makefile         |    2 -
 arch/x86/kernel/pmem.c           |   79 ++++-------------------------------
 drivers/nvdimm/Makefile          |    3 +
 drivers/nvdimm/e820.c            |   86 ++++++++++++++++++++++++++++++++++++++
 tools/testing/nvdimm/Kbuild      |    4 ++
 7 files changed, 108 insertions(+), 74 deletions(-)
 create mode 100644 drivers/nvdimm/e820.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 64829b17980b..5d6994f62c4d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1439,10 +1439,14 @@ config ILLEGAL_POINTER_VALUE
 
 source "mm/Kconfig"
 
+config X86_PMEM_LEGACY_DEVICE
+	bool
+
 config X86_PMEM_LEGACY
-	bool "Support non-standard NVDIMMs and ADR protected memory"
+	tristate "Support non-standard NVDIMMs and ADR protected memory"
 	depends on PHYS_ADDR_T_64BIT
 	depends on BLK_DEV
+	select X86_PMEM_LEGACY_DEVICE
 	select LIBNVDIMM
 	help
 	  Treat memory marked using the non-standard e820 type of 12 as used
diff --git a/arch/x86/include/uapi/asm/e820.h b/arch/x86/include/uapi/asm/e820.h
index 0f457e6eab18..9dafe59cf6e2 100644
--- a/arch/x86/include/uapi/asm/e820.h
+++ b/arch/x86/include/uapi/asm/e820.h
@@ -37,7 +37,7 @@
 /*
  * This is a non-standardized way to represent ADR or NVDIMM regions that
  * persist over a reboot.  The kernel will ignore their special capabilities
- * unless the CONFIG_X86_PMEM_LEGACY=y option is set.
+ * unless the CONFIG_X86_PMEM_LEGACY option is set.
  *
  * ( Note that older platforms also used 6 for the same type of memory,
  *   but newer versions switched to 12 as 6 was assigned differently.  Some
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0f15af41bd80..ac2bb7e28ba2 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -92,7 +92,7 @@ obj-$(CONFIG_KVM_GUEST)		+= kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)	+= pvclock.o
-obj-$(CONFIG_X86_PMEM_LEGACY)	+= pmem.o
+obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o
 
 obj-$(CONFIG_PCSPKR_PLATFORM)	+= pcspeaker.o
 
diff --git a/arch/x86/kernel/pmem.c b/arch/x86/kernel/pmem.c
index 64f90f53bb85..4f00b63d7ff3 100644
--- a/arch/x86/kernel/pmem.c
+++ b/arch/x86/kernel/pmem.c
@@ -3,80 +3,17 @@
  * Copyright (c) 2015, Intel Corporation.
  */
 #include <linux/platform_device.h>
-#include <linux/libnvdimm.h>
 #include <linux/module.h>
-#include <asm/e820.h>
-
-static void e820_pmem_release(struct device *dev)
-{
-	struct nvdimm_bus *nvdimm_bus = dev->platform_data;
-
-	if (nvdimm_bus)
-		nvdimm_bus_unregister(nvdimm_bus);
-}
-
-static struct platform_device e820_pmem = {
-	.name = "e820_pmem",
-	.id = -1,
-	.dev = {
-		.release = e820_pmem_release,
-	},
-};
-
-static const struct attribute_group *e820_pmem_attribute_groups[] = {
-	&nvdimm_bus_attribute_group,
-	NULL,
-};
-
-static const struct attribute_group *e820_pmem_region_attribute_groups[] = {
-	&nd_region_attribute_group,
-	&nd_device_attribute_group,
-	NULL,
-};
 
 static __init int register_e820_pmem(void)
 {
-	static struct nvdimm_bus_descriptor nd_desc;
-	struct device *dev = &e820_pmem.dev;
-	struct nvdimm_bus *nvdimm_bus;
-	int rc, i;
-
-	rc = platform_device_register(&e820_pmem);
-	if (rc)
-		return rc;
-
-	nd_desc.attr_groups = e820_pmem_attribute_groups;
-	nd_desc.provider_name = "e820";
-	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
-	if (!nvdimm_bus)
-		goto err;
-	dev->platform_data = nvdimm_bus;
-
-	for (i = 0; i < e820.nr_map; i++) {
-		struct e820entry *ei = &e820.map[i];
-		struct resource res = {
-			.flags	= IORESOURCE_MEM,
-			.start	= ei->addr,
-			.end	= ei->addr + ei->size - 1,
-		};
-		struct nd_region_desc ndr_desc;
-
-		if (ei->type != E820_PRAM)
-			continue;
-
-		memset(&ndr_desc, 0, sizeof(ndr_desc));
-		ndr_desc.res = &res;
-		ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
-		ndr_desc.numa_node = NUMA_NO_NODE;
-		if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
-			goto err;
-	}
-
-	return 0;
-
- err:
-	dev_err(dev, "failed to register legacy persistent memory ranges\n");
-	platform_device_unregister(&e820_pmem);
-	return -ENXIO;
+	struct platform_device *pdev;
+
+	/*
+	 * See drivers/nvdimm/e820.c for the implementation, this is
+	 * simply here to trigger the module to load on demand.
+	 */
+	pdev = platform_device_alloc("e820_pmem", -1);
+	return platform_device_add(pdev);
 }
 device_initcall(register_e820_pmem);
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 594bb97c867a..9bf15db52dee 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
+obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
 
 nd_pmem-y := pmem.o
 
@@ -9,6 +10,8 @@ nd_btt-y := btt.o
 
 nd_blk-y := blk.o
 
+nd_e820-y := e820.o
+
 libnvdimm-y := core.o
 libnvdimm-y += bus.o
 libnvdimm-y += dimm_devs.o
diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c
new file mode 100644
index 000000000000..1b5743ad92db
--- /dev/null
+++ b/drivers/nvdimm/e820.c
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2015, Christoph Hellwig.
+ * Copyright (c) 2015, Intel Corporation.
+ */
+#include <linux/platform_device.h>
+#include <linux/libnvdimm.h>
+#include <linux/module.h>
+
+static const struct attribute_group *e820_pmem_attribute_groups[] = {
+	&nvdimm_bus_attribute_group,
+	NULL,
+};
+
+static const struct attribute_group *e820_pmem_region_attribute_groups[] = {
+	&nd_region_attribute_group,
+	&nd_device_attribute_group,
+	NULL,
+};
+
+static int e820_pmem_remove(struct platform_device *pdev)
+{
+	struct nvdimm_bus *nvdimm_bus = platform_get_drvdata(pdev);
+
+	nvdimm_bus_unregister(nvdimm_bus);
+	return 0;
+}
+
+static int e820_pmem_probe(struct platform_device *pdev)
+{
+	static struct nvdimm_bus_descriptor nd_desc;
+	struct device *dev = &pdev->dev;
+	struct nvdimm_bus *nvdimm_bus;
+	struct resource *p;
+
+	nd_desc.attr_groups = e820_pmem_attribute_groups;
+	nd_desc.provider_name = "e820";
+	nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
+	if (!nvdimm_bus)
+		goto err;
+	platform_set_drvdata(pdev, nvdimm_bus);
+
+	for (p = iomem_resource.child; p ; p = p->sibling) {
+		struct nd_region_desc ndr_desc;
+
+		if (strncmp(p->name, "Persistent Memory (legacy)", 26) != 0)
+			continue;
+
+		memset(&ndr_desc, 0, sizeof(ndr_desc));
+		ndr_desc.res = p;
+		ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
+		ndr_desc.numa_node = NUMA_NO_NODE;
+		if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
+			goto err;
+	}
+
+	return 0;
+
+ err:
+	nvdimm_bus_unregister(nvdimm_bus);
+	dev_err(dev, "failed to register legacy persistent memory ranges\n");
+	return -ENXIO;
+}
+
+static struct platform_driver e820_pmem_driver = {
+	.probe = e820_pmem_probe,
+	.remove = e820_pmem_remove,
+	.driver = {
+		.name = "e820_pmem",
+	},
+};
+
+static __init int e820_pmem_init(void)
+{
+	return platform_driver_register(&e820_pmem_driver);
+}
+
+static __exit void e820_pmem_exit(void)
+{
+	platform_driver_unregister(&e820_pmem_driver);
+}
+
+MODULE_ALIAS("platform:e820_pmem*");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
+module_init(e820_pmem_init);
+module_exit(e820_pmem_exit);
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index 04c5fc09576d..e667579d38a0 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -15,6 +15,7 @@ obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
 obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
 obj-$(CONFIG_ND_BTT) += nd_btt.o
 obj-$(CONFIG_ND_BLK) += nd_blk.o
+obj-$(CONFIG_X86_PMEM_LEGACY) += nd_e820.o
 obj-$(CONFIG_ACPI_NFIT) += nfit.o
 
 nfit-y := $(ACPI_SRC)/nfit.o
@@ -29,6 +30,9 @@ nd_btt-y += config_check.o
 nd_blk-y := $(NVDIMM_SRC)/blk.o
 nd_blk-y += config_check.o
 
+nd_e820-y := $(NVDIMM_SRC)/e820.o
+nd_e820-y += config_check.o
+
 libnvdimm-y := $(NVDIMM_SRC)/core.o
 libnvdimm-y += $(NVDIMM_SRC)/bus.o
 libnvdimm-y += $(NVDIMM_SRC)/dimm_devs.o


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 6/7] libnvdimm, pfn: 'struct page' provider infrastructure
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
                   ` (4 preceding siblings ...)
  2015-08-13  3:50 ` [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-13  3:50 ` [RFC PATCH 7/7] libnvdimm, pmem: 'struct page' for pmem Dan Williams
  2015-08-15  9:01 ` [RFC PATCH 0/7] 'struct page' driver for persistent memory Christoph Hellwig
  7 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, david, mingo, linux-mm, mgorman,
	ross.zwisler, torvalds, hch

Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/Kconfig          |   22 +++
 drivers/nvdimm/Makefile         |    2 
 drivers/nvdimm/btt.c            |    8 -
 drivers/nvdimm/btt_devs.c       |  172 +-------------------
 drivers/nvdimm/claim.c          |  201 ++++++++++++++++++++++++
 drivers/nvdimm/namespace_devs.c |   34 +++-
 drivers/nvdimm/nd-core.h        |    9 +
 drivers/nvdimm/nd.h             |   59 +++++++
 drivers/nvdimm/pfn.h            |   35 ++++
 drivers/nvdimm/pfn_devs.c       |  333 +++++++++++++++++++++++++++++++++++++++
 drivers/nvdimm/region.c         |    2 
 drivers/nvdimm/region_devs.c    |   19 ++
 tools/testing/nvdimm/Kbuild     |    2 
 13 files changed, 720 insertions(+), 178 deletions(-)
 create mode 100644 drivers/nvdimm/claim.c
 create mode 100644 drivers/nvdimm/pfn.h
 create mode 100644 drivers/nvdimm/pfn_devs.c

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 0d8c6bda7a41..7e05f2657d09 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -22,6 +22,7 @@ config BLK_DEV_PMEM
 	depends on HAS_IOMEM
 	select KMAP_PFN
 	select ND_BTT if BTT
+	select ND_PFN if NVDIMM_PFN
 	help
 	  Memory ranges for PMEM are described by either an NFIT
 	  (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
@@ -48,12 +49,16 @@ config ND_BLK
 	  (CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
 	  capabilities.
 
+config ND_CLAIM
+	bool
+
 config ND_BTT
 	tristate
 
 config BTT
 	bool "BTT: Block Translation Table (atomic sector updates)"
 	default y if LIBNVDIMM
+	select ND_CLAIM
 	help
 	  The Block Translation Table (BTT) provides atomic sector
 	  update semantics for persistent memory devices, so that
@@ -66,4 +71,21 @@ config BTT
 
 	  Select Y if unsure
 
+config ND_PFN
+	tristate
+
+config NVDIMM_PFN
+	bool "PFN: Map persistent (device) memory"
+	default LIBNVDIMM
+	select ND_CLAIM
+	help
+	  Map persistent memory, i.e. advertise it to the memory
+	  management sub-system.  By default persistent memory does
+	  not support direct I/O, RDMA, or any other usage that
+	  requires a 'struct page' to mediate an I/O request.  This
+	  driver allocates and initializes the infrastructure needed
+	  to support those use cases.
+
+	  Select Y if unsure
+
 endif
diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
index 9bf15db52dee..ea84d3c4e8e5 100644
--- a/drivers/nvdimm/Makefile
+++ b/drivers/nvdimm/Makefile
@@ -20,4 +20,6 @@ libnvdimm-y += region_devs.o
 libnvdimm-y += region.o
 libnvdimm-y += namespace_devs.o
 libnvdimm-y += label.o
+libnvdimm-$(CONFIG_ND_CLAIM) += claim.o
 libnvdimm-$(CONFIG_BTT) += btt_devs.o
+libnvdimm-$(CONFIG_NVDIMM_PFN) += pfn_devs.o
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 552f1c4f4dc6..a0db65788a92 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -595,7 +595,7 @@ static int arena_is_valid(struct arena_info *arena, struct btt_sb *super,
 
 	checksum = le64_to_cpu(super->checksum);
 	super->checksum = 0;
-	if (checksum != nd_btt_sb_checksum(super))
+	if (checksum != nd_sb_checksum((struct nd_gen_sb *) super))
 		return 0;
 	super->checksum = cpu_to_le64(checksum);
 
@@ -759,6 +759,7 @@ static int create_arenas(struct btt *btt)
 static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
 {
 	int ret;
+	u64 sum;
 	struct btt_sb *super;
 
 	ret = btt_map_init(arena);
@@ -795,7 +796,8 @@ static int btt_arena_write_layout(struct arena_info *arena, u8 *uuid)
 	super->info2off = cpu_to_le64(arena->info2off - arena->infooff);
 
 	super->flags = 0;
-	super->checksum = cpu_to_le64(nd_btt_sb_checksum(super));
+	sum = nd_sb_checksum((struct nd_gen_sb *) super);
+	super->checksum = cpu_to_le64(sum);
 
 	ret = btt_info_write(arena, super);
 
@@ -1447,8 +1449,6 @@ static int __init nd_btt_init(void)
 {
 	int rc;
 
-	BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
-
 	btt_major = register_blkdev(0, "btt");
 	if (btt_major < 0)
 		return btt_major;
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 6ac8c0fea3ec..8256c481762b 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -21,63 +21,13 @@
 #include "btt.h"
 #include "nd.h"
 
-static void __nd_btt_detach_ndns(struct nd_btt *nd_btt)
-{
-	struct nd_namespace_common *ndns = nd_btt->ndns;
-
-	dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
-			|| ndns->claim != &nd_btt->dev,
-			"%s: invalid claim\n", __func__);
-	ndns->claim = NULL;
-	nd_btt->ndns = NULL;
-	put_device(&ndns->dev);
-}
-
-static void nd_btt_detach_ndns(struct nd_btt *nd_btt)
-{
-	struct nd_namespace_common *ndns = nd_btt->ndns;
-
-	if (!ndns)
-		return;
-	get_device(&ndns->dev);
-	device_lock(&ndns->dev);
-	__nd_btt_detach_ndns(nd_btt);
-	device_unlock(&ndns->dev);
-	put_device(&ndns->dev);
-}
-
-static bool __nd_btt_attach_ndns(struct nd_btt *nd_btt,
-		struct nd_namespace_common *ndns)
-{
-	if (ndns->claim)
-		return false;
-	dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
-			|| nd_btt->ndns,
-			"%s: invalid claim\n", __func__);
-	ndns->claim = &nd_btt->dev;
-	nd_btt->ndns = ndns;
-	get_device(&ndns->dev);
-	return true;
-}
-
-static bool nd_btt_attach_ndns(struct nd_btt *nd_btt,
-		struct nd_namespace_common *ndns)
-{
-	bool claimed;
-
-	device_lock(&ndns->dev);
-	claimed = __nd_btt_attach_ndns(nd_btt, ndns);
-	device_unlock(&ndns->dev);
-	return claimed;
-}
-
 static void nd_btt_release(struct device *dev)
 {
 	struct nd_region *nd_region = to_nd_region(dev->parent);
 	struct nd_btt *nd_btt = to_nd_btt(dev);
 
 	dev_dbg(dev, "%s\n", __func__);
-	nd_btt_detach_ndns(nd_btt);
+	nd_detach_ndns(&nd_btt->dev, &nd_btt->ndns);
 	ida_simple_remove(&nd_region->btt_ida, nd_btt->id);
 	kfree(nd_btt->uuid);
 	kfree(nd_btt);
@@ -172,104 +122,15 @@ static ssize_t namespace_show(struct device *dev,
 	return rc;
 }
 
-static int namespace_match(struct device *dev, void *data)
-{
-	char *name = data;
-
-	return strcmp(name, dev_name(dev)) == 0;
-}
-
-static bool is_nd_btt_idle(struct device *dev)
-{
-	struct nd_region *nd_region = to_nd_region(dev->parent);
-	struct nd_btt *nd_btt = to_nd_btt(dev);
-
-	if (nd_region->btt_seed == dev || nd_btt->ndns || dev->driver)
-		return false;
-	return true;
-}
-
-static ssize_t __namespace_store(struct device *dev,
-		struct device_attribute *attr, const char *buf, size_t len)
-{
-	struct nd_btt *nd_btt = to_nd_btt(dev);
-	struct nd_namespace_common *ndns;
-	struct device *found;
-	char *name;
-
-	if (dev->driver) {
-		dev_dbg(dev, "%s: -EBUSY\n", __func__);
-		return -EBUSY;
-	}
-
-	name = kstrndup(buf, len, GFP_KERNEL);
-	if (!name)
-		return -ENOMEM;
-	strim(name);
-
-	if (strncmp(name, "namespace", 9) == 0 || strcmp(name, "") == 0)
-		/* pass */;
-	else {
-		len = -EINVAL;
-		goto out;
-	}
-
-	ndns = nd_btt->ndns;
-	if (strcmp(name, "") == 0) {
-		/* detach the namespace and destroy / reset the btt device */
-		nd_btt_detach_ndns(nd_btt);
-		if (is_nd_btt_idle(dev))
-			nd_device_unregister(dev, ND_ASYNC);
-		else {
-			nd_btt->lbasize = 0;
-			kfree(nd_btt->uuid);
-			nd_btt->uuid = NULL;
-		}
-		goto out;
-	} else if (ndns) {
-		dev_dbg(dev, "namespace already set to: %s\n",
-				dev_name(&ndns->dev));
-		len = -EBUSY;
-		goto out;
-	}
-
-	found = device_find_child(dev->parent, name, namespace_match);
-	if (!found) {
-		dev_dbg(dev, "'%s' not found under %s\n", name,
-				dev_name(dev->parent));
-		len = -ENODEV;
-		goto out;
-	}
-
-	ndns = to_ndns(found);
-	if (__nvdimm_namespace_capacity(ndns) < SZ_16M) {
-		dev_dbg(dev, "%s too small to host btt\n", name);
-		len = -ENXIO;
-		goto out_attach;
-	}
-
-	WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
-	if (!nd_btt_attach_ndns(nd_btt, ndns)) {
-		dev_dbg(dev, "%s already claimed\n",
-				dev_name(&ndns->dev));
-		len = -EBUSY;
-	}
-
- out_attach:
-	put_device(&ndns->dev); /* from device_find_child */
- out:
-	kfree(name);
-	return len;
-}
-
 static ssize_t namespace_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t len)
 {
+	struct nd_btt *nd_btt = to_nd_btt(dev);
 	ssize_t rc;
 
 	nvdimm_bus_lock(dev);
 	device_lock(dev);
-	rc = __namespace_store(dev, attr, buf, len);
+	rc = nd_namespace_store(dev, &nd_btt->ndns, buf, len);
 	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
 			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
 	device_unlock(dev);
@@ -324,7 +185,7 @@ static struct device *__nd_btt_create(struct nd_region *nd_region,
 	dev->type = &nd_btt_device_type;
 	dev->groups = nd_btt_attribute_groups;
 	device_initialize(&nd_btt->dev);
-	if (ndns && !__nd_btt_attach_ndns(nd_btt, ndns)) {
+	if (ndns && !__nd_attach_ndns(&nd_btt->dev, ndns, &nd_btt->ndns)) {
 		dev_dbg(&ndns->dev, "%s failed, already claimed by %s\n",
 				__func__, dev_name(ndns->claim));
 		put_device(dev);
@@ -342,25 +203,6 @@ struct device *nd_btt_create(struct nd_region *nd_region)
 	return dev;
 }
 
-/*
- * nd_btt_sb_checksum: compute checksum for btt info block
- *
- * Returns a fletcher64 checksum of everything in the given info block
- * except the last field (since that's where the checksum lives).
- */
-u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
-{
-	u64 sum;
-	__le64 sum_save;
-
-	sum_save = btt_sb->checksum;
-	btt_sb->checksum = 0;
-	sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
-	btt_sb->checksum = sum_save;
-	return sum;
-}
-EXPORT_SYMBOL(nd_btt_sb_checksum);
-
 static int __nd_btt_probe(struct nd_btt *nd_btt,
 		struct nd_namespace_common *ndns, struct btt_sb *btt_sb)
 {
@@ -380,7 +222,7 @@ static int __nd_btt_probe(struct nd_btt *nd_btt,
 
 	checksum = le64_to_cpu(btt_sb->checksum);
 	btt_sb->checksum = 0;
-	if (checksum != nd_btt_sb_checksum(btt_sb))
+	if (checksum != nd_sb_checksum((struct nd_gen_sb *) btt_sb))
 		return -ENODEV;
 	btt_sb->checksum = cpu_to_le64(checksum);
 
@@ -416,7 +258,9 @@ int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
 	dev_dbg(&ndns->dev, "%s: btt: %s\n", __func__,
 			rc == 0 ? dev_name(dev) : "<none>");
 	if (rc < 0) {
-		__nd_btt_detach_ndns(to_nd_btt(dev));
+		struct nd_btt *nd_btt = to_nd_btt(dev);
+
+		__nd_detach_ndns(dev, &nd_btt->ndns);
 		put_device(dev);
 	}
 
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
new file mode 100644
index 000000000000..e8f03b0e95e4
--- /dev/null
+++ b/drivers/nvdimm/claim.c
@@ -0,0 +1,201 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/device.h>
+#include <linux/sizes.h>
+#include "nd-core.h"
+#include "pfn.h"
+#include "btt.h"
+#include "nd.h"
+
+void __nd_detach_ndns(struct device *dev, struct nd_namespace_common **_ndns)
+{
+	struct nd_namespace_common *ndns = *_ndns;
+
+	dev_WARN_ONCE(dev, !mutex_is_locked(&ndns->dev.mutex)
+			|| ndns->claim != dev,
+			"%s: invalid claim\n", __func__);
+	ndns->claim = NULL;
+	*_ndns = NULL;
+	put_device(&ndns->dev);
+}
+
+void nd_detach_ndns(struct device *dev,
+		struct nd_namespace_common **_ndns)
+{
+	struct nd_namespace_common *ndns = *_ndns;
+
+	if (!ndns)
+		return;
+	get_device(&ndns->dev);
+	device_lock(&ndns->dev);
+	__nd_detach_ndns(dev, _ndns);
+	device_unlock(&ndns->dev);
+	put_device(&ndns->dev);
+}
+
+bool __nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+		struct nd_namespace_common **_ndns)
+{
+	if (attach->claim)
+		return false;
+	dev_WARN_ONCE(dev, !mutex_is_locked(&attach->dev.mutex)
+			|| *_ndns,
+			"%s: invalid claim\n", __func__);
+	attach->claim = dev;
+	*_ndns = attach;
+	get_device(&attach->dev);
+	return true;
+}
+
+bool nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+		struct nd_namespace_common **_ndns)
+{
+	bool claimed;
+
+	device_lock(&attach->dev);
+	claimed = __nd_attach_ndns(dev, attach, _ndns);
+	device_unlock(&attach->dev);
+	return claimed;
+}
+
+static int namespace_match(struct device *dev, void *data)
+{
+	char *name = data;
+
+	return strcmp(name, dev_name(dev)) == 0;
+}
+
+static bool is_idle(struct device *dev, struct nd_namespace_common *ndns)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct device *seed = NULL;
+
+	if (is_nd_btt(dev))
+		seed = nd_region->btt_seed;
+	else if (is_nd_pfn(dev))
+		seed = nd_region->pfn_seed;
+
+	if (seed == dev || ndns || dev->driver)
+		return false;
+	return true;
+}
+
+static void nd_detach_and_reset(struct device *dev,
+		struct nd_namespace_common **_ndns)
+{
+	/* detach the namespace and destroy / reset the device */
+	nd_detach_ndns(dev, _ndns);
+	if (is_idle(dev, *_ndns)) {
+		nd_device_unregister(dev, ND_ASYNC);
+	} else if (is_nd_btt(dev)) {
+		struct nd_btt *nd_btt = to_nd_btt(dev);
+
+		nd_btt->lbasize = 0;
+		kfree(nd_btt->uuid);
+		nd_btt->uuid = NULL;
+	} else if (is_nd_pfn(dev)) {
+		struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+		kfree(nd_pfn->uuid);
+		nd_pfn->uuid = NULL;
+		nd_pfn->mode = PFN_MODE_NONE;
+	}
+}
+
+ssize_t nd_namespace_store(struct device *dev,
+		struct nd_namespace_common **_ndns, const char *buf,
+		size_t len)
+{
+	struct nd_namespace_common *ndns;
+	struct device *found;
+	char *name;
+
+	if (dev->driver) {
+		dev_dbg(dev, "%s: -EBUSY\n", __func__);
+		return -EBUSY;
+	}
+
+	name = kstrndup(buf, len, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+	strim(name);
+
+	if (strncmp(name, "namespace", 9) == 0 || strcmp(name, "") == 0)
+		/* pass */;
+	else {
+		len = -EINVAL;
+		goto out;
+	}
+
+	ndns = *_ndns;
+	if (strcmp(name, "") == 0) {
+		nd_detach_and_reset(dev, _ndns);
+		goto out;
+	} else if (ndns) {
+		dev_dbg(dev, "namespace already set to: %s\n",
+				dev_name(&ndns->dev));
+		len = -EBUSY;
+		goto out;
+	}
+
+	found = device_find_child(dev->parent, name, namespace_match);
+	if (!found) {
+		dev_dbg(dev, "'%s' not found under %s\n", name,
+				dev_name(dev->parent));
+		len = -ENODEV;
+		goto out;
+	}
+
+	ndns = to_ndns(found);
+	if (__nvdimm_namespace_capacity(ndns) < SZ_16M) {
+		dev_dbg(dev, "%s too small to host\n", name);
+		len = -ENXIO;
+		goto out_attach;
+	}
+
+	WARN_ON_ONCE(!is_nvdimm_bus_locked(dev));
+	if (!nd_attach_ndns(dev, ndns, _ndns)) {
+		dev_dbg(dev, "%s already claimed\n",
+				dev_name(&ndns->dev));
+		len = -EBUSY;
+	}
+
+ out_attach:
+	put_device(&ndns->dev); /* from device_find_child */
+ out:
+	kfree(name);
+	return len;
+}
+
+/*
+ * nd_sb_checksum: compute checksum for a generic info block
+ *
+ * Returns a fletcher64 checksum of everything in the given info block
+ * except the last field (since that's where the checksum lives).
+ */
+u64 nd_sb_checksum(struct nd_gen_sb *nd_gen_sb)
+{
+	u64 sum;
+	__le64 sum_save;
+
+	BUILD_BUG_ON(sizeof(struct btt_sb) != SZ_4K);
+	BUILD_BUG_ON(sizeof(struct nd_pfn_sb) != SZ_4K);
+	BUILD_BUG_ON(sizeof(struct nd_gen_sb) != SZ_4K);
+
+	sum_save = nd_gen_sb->checksum;
+	nd_gen_sb->checksum = 0;
+	sum = nd_fletcher64(nd_gen_sb, sizeof(*nd_gen_sb), 1);
+	nd_gen_sb->checksum = sum_save;
+	return sum;
+}
+EXPORT_SYMBOL(nd_sb_checksum);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index fef0dd80d4ad..74e128eccbed 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -82,8 +82,16 @@ const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
 	struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
 	const char *suffix = "";
 
-	if (ndns->claim && is_nd_btt(ndns->claim))
-		suffix = "s";
+	if (ndns->claim) {
+		if (is_nd_btt(ndns->claim))
+			suffix = "s";
+		else if (is_nd_pfn(ndns->claim))
+			suffix = "m";
+		else
+			dev_WARN_ONCE(&ndns->dev, 1,
+					"unknown claim type by %s\n",
+					dev_name(ndns->claim));
+	}
 
 	if (is_namespace_pmem(&ndns->dev) || is_namespace_io(&ndns->dev))
 		sprintf(name, "pmem%d%s", nd_region->id, suffix);
@@ -1235,12 +1243,22 @@ static const struct attribute_group *nd_namespace_attribute_groups[] = {
 struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
 {
 	struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL;
+	struct nd_pfn *nd_pfn = is_nd_pfn(dev) ? to_nd_pfn(dev) : NULL;
 	struct nd_namespace_common *ndns;
 	resource_size_t size;
 
-	if (nd_btt) {
-		ndns = nd_btt->ndns;
-		if (!ndns)
+	if (nd_btt || nd_pfn) {
+		struct device *host = NULL;
+
+		if (nd_btt) {
+			host = &nd_btt->dev;
+			ndns = nd_btt->ndns;
+		} else if (nd_pfn) {
+			host = &nd_pfn->dev;
+			ndns = nd_pfn->ndns;
+		}
+
+		if (!ndns || !host)
 			return ERR_PTR(-ENODEV);
 
 		/*
@@ -1251,12 +1269,12 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
 		device_unlock(&ndns->dev);
 		if (ndns->dev.driver) {
 			dev_dbg(&ndns->dev, "is active, can't bind %s\n",
-					dev_name(&nd_btt->dev));
+					dev_name(host));
 			return ERR_PTR(-EBUSY);
 		}
-		if (dev_WARN_ONCE(&ndns->dev, ndns->claim != &nd_btt->dev,
+		if (dev_WARN_ONCE(&ndns->dev, ndns->claim != host,
 					"host (%s) vs claim (%s) mismatch\n",
-					dev_name(&nd_btt->dev),
+					dev_name(host),
 					dev_name(ndns->claim)))
 			return ERR_PTR(-ENXIO);
 	} else {
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index e1970c71ad1c..159aed532042 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -80,4 +80,13 @@ struct resource *nsblk_add_resource(struct nd_region *nd_region,
 int nvdimm_num_label_slots(struct nvdimm_drvdata *ndd);
 void get_ndd(struct nvdimm_drvdata *ndd);
 resource_size_t __nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
+void nd_detach_ndns(struct device *dev, struct nd_namespace_common **_ndns);
+void __nd_detach_ndns(struct device *dev, struct nd_namespace_common **_ndns);
+bool nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+		struct nd_namespace_common **_ndns);
+bool __nd_attach_ndns(struct device *dev, struct nd_namespace_common *attach,
+		struct nd_namespace_common **_ndns);
+ssize_t nd_namespace_store(struct device *dev,
+		struct nd_namespace_common **_ndns, const char *buf,
+		size_t len);
 #endif /* __ND_CORE_H__ */
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 835263e47bb8..372334835a13 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -29,6 +29,8 @@ enum {
 	ND_MAX_LANES = 256,
 	SECTOR_SHIFT = 9,
 	INT_LBASIZE_ALIGNMENT = 64,
+	ND_PFN_ALIGN = PAGES_PER_SECTION * PAGE_SIZE,
+	ND_PFN_MASK = ND_PFN_ALIGN - 1,
 };
 
 struct nvdimm_drvdata {
@@ -92,8 +94,10 @@ struct nd_region {
 	struct device dev;
 	struct ida ns_ida;
 	struct ida btt_ida;
+	struct ida pfn_ida;
 	struct device *ns_seed;
 	struct device *btt_seed;
+	struct device *pfn_seed;
 	u16 ndr_mappings;
 	u64 ndr_size;
 	u64 ndr_start;
@@ -133,6 +137,24 @@ struct nd_btt {
 	int id;
 };
 
+enum nd_pfn_mode {
+	PFN_MODE_NONE,
+	PFN_MODE_RAM,
+	PFN_MODE_PMEM,
+};
+
+struct dev_map;
+struct nd_pfn {
+	int id;
+	u8 *uuid;
+	struct device dev;
+	struct dev_map *dev_map;
+	unsigned long npfns;
+	enum nd_pfn_mode mode;
+	struct nd_pfn_sb *pfn_sb;
+	struct nd_namespace_common *ndns;
+};
+
 enum nd_async_mode {
 	ND_SYNC,
 	ND_ASYNC,
@@ -159,8 +181,13 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd);
 int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset,
 		void *buf, size_t len);
 struct nd_btt *to_nd_btt(struct device *dev);
-struct btt_sb;
-u64 nd_btt_sb_checksum(struct btt_sb *btt_sb);
+
+struct nd_gen_sb {
+	char reserved[SZ_4K - 8];
+	__le64 checksum;
+};
+
+u64 nd_sb_checksum(struct nd_gen_sb *sb);
 #if IS_ENABLED(CONFIG_BTT)
 int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata);
 bool is_nd_btt(struct device *dev);
@@ -180,8 +207,36 @@ static inline struct device *nd_btt_create(struct nd_region *nd_region)
 {
 	return NULL;
 }
+#endif
+
+struct nd_pfn *to_nd_pfn(struct device *dev);
+#if IS_ENABLED(CONFIG_NVDIMM_PFN)
+int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata);
+bool is_nd_pfn(struct device *dev);
+struct device *nd_pfn_create(struct nd_region *nd_region);
+int nd_pfn_validate(struct nd_pfn *nd_pfn);
+#else
+static inline int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
+{
+	return -ENODEV;
+}
+
+static inline bool is_nd_pfn(struct device *dev)
+{
+	return false;
+}
+
+static inline struct device *nd_pfn_create(struct nd_region *nd_region)
+{
+	return NULL;
+}
 
+static inline int nd_pfn_validate(struct nd_pfn *nd_pfn)
+{
+	return -ENXIO;
+}
 #endif
+
 struct nd_region *to_nd_region(struct device *dev);
 int nd_region_to_nstype(struct nd_region *nd_region);
 int nd_region_register_namespaces(struct nd_region *nd_region, int *err);
diff --git a/drivers/nvdimm/pfn.h b/drivers/nvdimm/pfn.h
new file mode 100644
index 000000000000..cc243754acef
--- /dev/null
+++ b/drivers/nvdimm/pfn.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2014-2015, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef __NVDIMM_PFN_H
+#define __NVDIMM_PFN_H
+
+#include <linux/types.h>
+
+#define PFN_SIG_LEN 16
+#define PFN_SIG "NVDIMM_PFN_INFO\0"
+
+struct nd_pfn_sb {
+	u8 signature[PFN_SIG_LEN];
+	u8 uuid[16];
+	u8 parent_uuid[16];
+	__le32 flags;
+	__le16 version_major;
+	__le16 version_minor;
+	__le64 dataoff;
+	__le64 npfns;
+	__le32 mode;
+	u8 padding[4012];
+	__le64 checksum;
+};
+#endif /* __NVDIMM_PFN_H */
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
new file mode 100644
index 000000000000..41824b3bf0d2
--- /dev/null
+++ b/drivers/nvdimm/pfn_devs.c
@@ -0,0 +1,333 @@
+/*
+ * Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/blkdev.h>
+#include <linux/device.h>
+#include <linux/genhd.h>
+#include <linux/sizes.h>
+#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include "nd-core.h"
+#include "pfn.h"
+#include "nd.h"
+
+static void nd_pfn_release(struct device *dev)
+{
+	struct nd_region *nd_region = to_nd_region(dev->parent);
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+	dev_dbg(dev, "%s\n", __func__);
+	nd_detach_ndns(&nd_pfn->dev, &nd_pfn->ndns);
+	ida_simple_remove(&nd_region->pfn_ida, nd_pfn->id);
+	kfree(nd_pfn->uuid);
+	kfree(nd_pfn);
+}
+
+static struct device_type nd_pfn_device_type = {
+	.name = "nd_pfn",
+	.release = nd_pfn_release,
+};
+
+bool is_nd_pfn(struct device *dev)
+{
+	return dev ? dev->type == &nd_pfn_device_type : false;
+}
+EXPORT_SYMBOL(is_nd_pfn);
+
+struct nd_pfn *to_nd_pfn(struct device *dev)
+{
+	struct nd_pfn *nd_pfn = container_of(dev, struct nd_pfn, dev);
+
+	WARN_ON(!is_nd_pfn(dev));
+	return nd_pfn;
+}
+EXPORT_SYMBOL(to_nd_pfn);
+
+static ssize_t mode_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+	switch (nd_pfn->mode) {
+	case PFN_MODE_RAM:
+		return sprintf(buf, "ram\n");
+	case PFN_MODE_PMEM:
+		return sprintf(buf, "pmem\n");
+	default:
+		return sprintf(buf, "none\n");
+	}
+}
+
+static ssize_t mode_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+	ssize_t rc = 0;
+
+	device_lock(dev);
+	nvdimm_bus_lock(dev);
+	if (dev->driver)
+		rc = -EBUSY;
+	else {
+		size_t n = len - 1;
+
+		if (strncmp(buf, "pmem\n", n) == 0
+				|| strncmp(buf, "pmem", n) == 0)
+			nd_pfn->mode = PFN_MODE_PMEM;
+		else if (strncmp(buf, "ram\n", n) == 0
+				|| strncmp(buf, "ram", n) == 0)
+			nd_pfn->mode = PFN_MODE_RAM;
+		else if (strncmp(buf, "none\n", n) == 0
+				|| strncmp(buf, "none", n) == 0)
+			nd_pfn->mode = PFN_MODE_NONE;
+		else
+			rc = -EINVAL;
+	}
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	nvdimm_bus_unlock(dev);
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(mode);
+
+static ssize_t uuid_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+
+	if (nd_pfn->uuid)
+		return sprintf(buf, "%pUb\n", nd_pfn->uuid);
+	return sprintf(buf, "\n");
+}
+
+static ssize_t uuid_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+	ssize_t rc;
+
+	device_lock(dev);
+	rc = nd_uuid_store(dev, &nd_pfn->uuid, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+
+	return rc ? rc : len;
+}
+static DEVICE_ATTR_RW(uuid);
+
+static ssize_t namespace_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	rc = sprintf(buf, "%s\n", nd_pfn->ndns
+			? dev_name(&nd_pfn->ndns->dev) : "");
+	nvdimm_bus_unlock(dev);
+	return rc;
+}
+
+static ssize_t namespace_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(dev);
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	device_lock(dev);
+	rc = nd_namespace_store(dev, &nd_pfn->ndns, buf, len);
+	dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
+			rc, buf, buf[len - 1] == '\n' ? "" : "\n");
+	device_unlock(dev);
+	nvdimm_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RW(namespace);
+
+static struct attribute *nd_pfn_attributes[] = {
+	&dev_attr_mode.attr,
+	&dev_attr_namespace.attr,
+	&dev_attr_uuid.attr,
+	NULL,
+};
+
+static struct attribute_group nd_pfn_attribute_group = {
+	.attrs = nd_pfn_attributes,
+};
+
+static const struct attribute_group *nd_pfn_attribute_groups[] = {
+	&nd_pfn_attribute_group,
+	&nd_device_attribute_group,
+	&nd_numa_attribute_group,
+	NULL,
+};
+
+static struct device *__nd_pfn_create(struct nd_region *nd_region,
+		u8 *uuid, enum nd_pfn_mode mode,
+		struct nd_namespace_common *ndns)
+{
+	struct nd_pfn *nd_pfn;
+	struct device *dev;
+
+	/* we can only create pages for contiguous ranged of pmem */
+	if (!is_nd_pmem(&nd_region->dev))
+		return NULL;
+
+	nd_pfn = kzalloc(sizeof(*nd_pfn), GFP_KERNEL);
+	if (!nd_pfn)
+		return NULL;
+
+	nd_pfn->id = ida_simple_get(&nd_region->pfn_ida, 0, 0, GFP_KERNEL);
+	if (nd_pfn->id < 0) {
+		kfree(nd_pfn);
+		return NULL;
+	}
+
+	nd_pfn->mode = mode;
+	if (uuid)
+		uuid = kmemdup(uuid, 16, GFP_KERNEL);
+	nd_pfn->uuid = uuid;
+	dev = &nd_pfn->dev;
+	dev_set_name(dev, "pfn%d.%d", nd_region->id, nd_pfn->id);
+	dev->parent = &nd_region->dev;
+	dev->type = &nd_pfn_device_type;
+	dev->groups = nd_pfn_attribute_groups;
+	device_initialize(&nd_pfn->dev);
+	if (ndns && !__nd_attach_ndns(&nd_pfn->dev, ndns, &nd_pfn->ndns)) {
+		dev_dbg(&ndns->dev, "%s failed, already claimed by %s\n",
+				__func__, dev_name(ndns->claim));
+		put_device(dev);
+		return NULL;
+	}
+	return dev;
+}
+
+struct device *nd_pfn_create(struct nd_region *nd_region)
+{
+	struct device *dev = __nd_pfn_create(nd_region, NULL, PFN_MODE_NONE,
+			NULL);
+
+	if (dev)
+		__nd_device_register(dev);
+	return dev;
+}
+
+static int nd_pfn_validate(struct nd_pfn *nd_pfn)
+{
+	struct nd_namespace_common *ndns = nd_pfn->ndns;
+	struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
+	struct nd_namespace_io *nsio;
+	u64 checksum, offset;
+
+	if (!pfn_sb || !ndns)
+		return -ENODEV;
+
+	if (!is_nd_pmem(nd_pfn->dev.parent))
+		return -ENODEV;
+
+	/* section alignment for simple hotplug */
+	if (nvdimm_namespace_capacity(ndns) < ND_PFN_ALIGN)
+		return -ENODEV;
+
+	if (nvdimm_read_bytes(ndns, SZ_4K, pfn_sb, sizeof(*pfn_sb)))
+		return -ENXIO;
+
+	if (memcmp(pfn_sb->signature, PFN_SIG, PFN_SIG_LEN) != 0)
+		return -ENODEV;
+
+	checksum = le64_to_cpu(pfn_sb->checksum);
+	pfn_sb->checksum = 0;
+	if (checksum != nd_sb_checksum((struct nd_gen_sb *) pfn_sb))
+		return -ENODEV;
+	pfn_sb->checksum = cpu_to_le64(checksum);
+
+	switch (le32_to_cpu(pfn_sb->mode)) {
+	case PFN_MODE_RAM:
+	case PFN_MODE_PMEM:
+		break;
+	default:
+		return -ENXIO;
+	}
+
+	if (!nd_pfn->uuid) {
+		/* from probe we allocate */
+		nd_pfn->uuid = kmemdup(pfn_sb->uuid, 16, GFP_KERNEL);
+		if (!nd_pfn->uuid)
+			return -ENOMEM;
+	} else {
+		/* from init we validate */
+		if (memcmp(nd_pfn->uuid, pfn_sb->uuid, 16) != 0)
+			return -EINVAL;
+	}
+
+	/*
+	 * These warnings are verbose because they can only trigger in
+	 * the case where the physical address alignment of the
+	 * namespace has changed since the pfn superblock was
+	 * established.
+	 */
+	offset = le64_to_cpu(pfn_sb->dataoff);
+	nsio = to_nd_namespace_io(&ndns->dev);
+	if ((nsio->res.start + offset) & (ND_PFN_ALIGN - 1)) {
+		dev_err(&nd_pfn->dev,
+				"init failed: %s with offset %#llx not section aligned\n",
+				dev_name(&ndns->dev), offset);
+		return -EBUSY;
+	} else if (offset >= resource_size(&nsio->res)) {
+		dev_err(&nd_pfn->dev, "pfn array size exceeds capacity of %s\n",
+				dev_name(&ndns->dev));
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
+int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
+{
+	int rc;
+	struct device *dev;
+	struct nd_pfn *nd_pfn;
+	struct nd_pfn_sb *pfn_sb;
+	struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
+
+	if (ndns->force_raw)
+		return -ENODEV;
+
+	nvdimm_bus_lock(&ndns->dev);
+	dev = __nd_pfn_create(nd_region, NULL, PFN_MODE_NONE, ndns);
+	nvdimm_bus_unlock(&ndns->dev);
+	if (!dev)
+		return -ENOMEM;
+	dev_set_drvdata(dev, drvdata);
+	pfn_sb = kzalloc(sizeof(*pfn_sb), GFP_KERNEL);
+	nd_pfn = to_nd_pfn(dev);
+	nd_pfn->pfn_sb = pfn_sb;
+	rc = nd_pfn_validate(nd_pfn);
+	nd_pfn->pfn_sb = NULL;
+	kfree(pfn_sb);
+	dev_dbg(&ndns->dev, "%s: pfn: %s\n", __func__,
+			rc == 0 ? dev_name(dev) : "<none>");
+	if (rc < 0) {
+		__nd_detach_ndns(dev, &nd_pfn->ndns);
+		put_device(dev);
+	} else
+		__nd_device_register(&nd_pfn->dev);
+
+	return rc;
+}
+EXPORT_SYMBOL(nd_pfn_probe);
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index f28f78ccff19..7da63eac78ee 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -53,6 +53,7 @@ static int nd_region_probe(struct device *dev)
 		return -ENODEV;
 
 	nd_region->btt_seed = nd_btt_create(nd_region);
+	nd_region->pfn_seed = nd_pfn_create(nd_region);
 	if (err == 0)
 		return 0;
 
@@ -84,6 +85,7 @@ static int nd_region_remove(struct device *dev)
 	nvdimm_bus_lock(dev);
 	nd_region->ns_seed = NULL;
 	nd_region->btt_seed = NULL;
+	nd_region->pfn_seed = NULL;
 	dev_set_drvdata(dev, NULL);
 	nvdimm_bus_unlock(dev);
 
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 7384455792bf..da4338154ad2 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -345,6 +345,23 @@ static ssize_t btt_seed_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(btt_seed);
 
+static ssize_t pfn_seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct nd_region *nd_region = to_nd_region(dev);
+	ssize_t rc;
+
+	nvdimm_bus_lock(dev);
+	if (nd_region->pfn_seed)
+		rc = sprintf(buf, "%s\n", dev_name(nd_region->pfn_seed));
+	else
+		rc = sprintf(buf, "\n");
+	nvdimm_bus_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(pfn_seed);
+
 static ssize_t read_only_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -373,6 +390,7 @@ static struct attribute *nd_region_attributes[] = {
 	&dev_attr_nstype.attr,
 	&dev_attr_mappings.attr,
 	&dev_attr_btt_seed.attr,
+	&dev_attr_pfn_seed.attr,
 	&dev_attr_read_only.attr,
 	&dev_attr_set_cookie.attr,
 	&dev_attr_available_size.attr,
@@ -744,6 +762,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	nd_region->numa_node = ndr_desc->numa_node;
 	ida_init(&nd_region->ns_ida);
 	ida_init(&nd_region->btt_ida);
+	ida_init(&nd_region->pfn_ida);
 	dev = &nd_region->dev;
 	dev_set_name(dev, "region%d", nd_region->id);
 	dev->parent = &nvdimm_bus->dev;
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index e667579d38a0..22d4d19a49bc 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -41,7 +41,9 @@ libnvdimm-y += $(NVDIMM_SRC)/region_devs.o
 libnvdimm-y += $(NVDIMM_SRC)/region.o
 libnvdimm-y += $(NVDIMM_SRC)/namespace_devs.o
 libnvdimm-y += $(NVDIMM_SRC)/label.o
+libnvdimm-$(CONFIG_ND_CLAIM) += $(NVDIMM_SRC)/claim.o
 libnvdimm-$(CONFIG_BTT) += $(NVDIMM_SRC)/btt_devs.o
+libnvdimm-$(CONFIG_NVDIMM_PFN) += $(NVDIMM_SRC)/pfn_devs.o
 libnvdimm-y += config_check.o
 
 obj-m += test/


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH 7/7] libnvdimm, pmem: 'struct page' for pmem
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
                   ` (5 preceding siblings ...)
  2015-08-13  3:50 ` [RFC PATCH 6/7] libnvdimm, pfn: 'struct page' provider infrastructure Dan Williams
@ 2015-08-13  3:50 ` Dan Williams
  2015-08-15  9:01 ` [RFC PATCH 0/7] 'struct page' driver for persistent memory Christoph Hellwig
  7 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-13  3:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: boaz, riel, linux-nvdimm, david, mingo, linux-mm, mgorman,
	ross.zwisler, torvalds, hch

Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation can
either come from RAM (if the mapped PMEM capacity is low), or from PMEM
(if the write endurance and relative performance concerns of PMEM are
low).

Given this adds a new call to devm_memunmap() the corresponding wrapper
is added to tools/testing/nvdimm/.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/nvdimm/Kconfig            |    4 +
 drivers/nvdimm/pfn_devs.c         |    3 -
 drivers/nvdimm/pmem.c             |  213 ++++++++++++++++++++++++++++++++++++-
 tools/testing/nvdimm/Kbuild       |    1 
 tools/testing/nvdimm/test/iomap.c |   13 ++
 5 files changed, 225 insertions(+), 9 deletions(-)

diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 7e05f2657d09..87fbd693c68e 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -77,6 +77,10 @@ config ND_PFN
 config NVDIMM_PFN
 	bool "PFN: Map persistent (device) memory"
 	default LIBNVDIMM
+	depends on MEMORY_HOTPLUG
+	depends on MEMORY_HOTREMOVE
+	depends on X86_64 #arch_add_memory() comprehends device memory
+	depends on ZONE_DEVICE
 	select ND_CLAIM
 	help
 	  Map persistent memory, i.e. advertise it to the memory
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 41824b3bf0d2..b1319ed53642 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -227,7 +227,7 @@ struct device *nd_pfn_create(struct nd_region *nd_region)
 	return dev;
 }
 
-static int nd_pfn_validate(struct nd_pfn *nd_pfn)
+int nd_pfn_validate(struct nd_pfn *nd_pfn)
 {
 	struct nd_namespace_common *ndns = nd_pfn->ndns;
 	struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
@@ -296,6 +296,7 @@ static int nd_pfn_validate(struct nd_pfn *nd_pfn)
 
 	return 0;
 }
+EXPORT_SYMBOL(nd_pfn_validate);
 
 int nd_pfn_probe(struct nd_namespace_common *ndns, void *drvdata)
 {
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 85d4101bb821..51867759b3f3 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -22,19 +22,24 @@
 #include <linux/platform_device.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
+#include <linux/vmalloc.h>
 #include <linux/slab.h>
 #include <linux/pmem.h>
 #include <linux/nd.h>
 #include <linux/mm.h>
 #include <linux/kmap_pfn.h>
+#include "pfn.h"
 #include "nd.h"
 
 struct pmem_device {
 	struct request_queue	*pmem_queue;
 	struct gendisk		*pmem_disk;
+	struct nd_namespace_common *ndns;
 
 	/* One contiguous memory region per device */
 	phys_addr_t		phys_addr;
+	/* when non-zero this device is hosting a 'pfn' instance */
+	phys_addr_t		data_offset;
 	void __pmem		*virt_addr;
 	size_t			size;
 };
@@ -46,7 +51,7 @@ static void pmem_do_bvec(struct pmem_device *pmem, struct page *page,
 			sector_t sector)
 {
 	void *mem = kmap_atomic(page);
-	size_t pmem_off = sector << 9;
+	phys_addr_t pmem_off = sector * 512 + pmem->data_offset;
 	void __pmem *pmem_addr = pmem->virt_addr + pmem_off;
 
 	if (rw == READ) {
@@ -97,10 +102,23 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector,
 		__pfn_t *pfn)
 {
 	struct pmem_device *pmem = bdev->bd_disk->private_data;
-	size_t offset = sector << 9;
-
-	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, PFN_DEV);
-	return pmem->size - offset;
+	resource_size_t offset = sector * 512 + pmem->data_offset;
+	unsigned long flags = PFN_DEV;
+	resource_size_t size;
+
+	if (pmem->data_offset) {
+		flags |= PFN_MAP;
+		/*
+		 * Limit the direct_access() size to what is covered by
+		 * the memmap
+		 */
+		size = (pmem->size - offset) & ~ND_PFN_MASK;
+	} else
+		size = pmem->size - offset;
+
+	*pfn = phys_to_pfn_t(pmem->phys_addr + offset, flags);
+
+	return size;
 }
 
 static const struct block_device_operations pmem_fops = {
@@ -140,6 +158,9 @@ static struct pmem_device *pmem_alloc(struct device *dev,
 
 static void pmem_detach_disk(struct pmem_device *pmem)
 {
+	if (!pmem->pmem_disk)
+		return;
+
 	del_gendisk(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
 	blk_cleanup_queue(pmem->pmem_queue);
@@ -181,7 +202,7 @@ static int pmem_attach_disk(struct device *dev,
 	disk->flags		= GENHD_FL_EXT_DEVT;
 	nvdimm_namespace_disk_name(ndns, disk->disk_name);
 	disk->driverfs_dev = &ndns->dev;
-	set_capacity(disk, pmem->size >> 9);
+	set_capacity(disk, (pmem->size - pmem->data_offset) / 512);
 	pmem->pmem_disk = disk;
 
 	add_disk(disk);
@@ -210,6 +231,170 @@ static int pmem_rw_bytes(struct nd_namespace_common *ndns,
 	return 0;
 }
 
+static int nd_pfn_init(struct nd_pfn *nd_pfn)
+{
+	struct nd_pfn_sb *pfn_sb = kzalloc(sizeof(*pfn_sb), GFP_KERNEL);
+	struct pmem_device *pmem = dev_get_drvdata(&nd_pfn->dev);
+	struct nd_namespace_common *ndns = nd_pfn->ndns;
+	struct nd_region *nd_region;
+	unsigned long npfns;
+	phys_addr_t offset;
+	u64 checksum;
+	int rc;
+
+	if (!pfn_sb)
+		return -ENOMEM;
+
+	nd_pfn->pfn_sb = pfn_sb;
+	rc = nd_pfn_validate(nd_pfn);
+	if (rc == 0 || rc == -EBUSY)
+		return rc;
+
+	/* section alignment for simple hotplug */
+	if (nvdimm_namespace_capacity(ndns) < ND_PFN_ALIGN
+			|| pmem->phys_addr & ND_PFN_MASK)
+		return -ENODEV;
+
+	nd_region = to_nd_region(nd_pfn->dev.parent);
+	if (nd_region->ro) {
+		dev_info(&nd_pfn->dev,
+				"%s is read-only, unable to init metadata\n",
+				dev_name(&nd_region->dev));
+		goto err;
+	}
+
+	memset(pfn_sb, 0, sizeof(*pfn_sb));
+	npfns = (pmem->size - SZ_8K) / SZ_4K;
+	/*
+	 * Note, we use 64 here for the standard size of struct page,
+	 * debugging options may cause it to be larger in which case the
+	 * implementation will limit the pfns advertised through
+	 * ->direct_access() to those that are included in the memmap.
+	 */
+	if (nd_pfn->mode == PFN_MODE_PMEM)
+		offset = SZ_8K + 64 * npfns;
+	else if (nd_pfn->mode == PFN_MODE_RAM)
+		offset = SZ_8K;
+	else
+		goto err;
+
+	offset = ALIGN(offset, PMD_SIZE);
+	npfns = (pmem->size - offset) / SZ_4K;
+	pfn_sb->mode = cpu_to_le32(nd_pfn->mode);
+	pfn_sb->dataoff = cpu_to_le64(offset);
+	pfn_sb->npfns = cpu_to_le64(npfns);
+	memcpy(pfn_sb->signature, PFN_SIG, PFN_SIG_LEN);
+	memcpy(pfn_sb->uuid, nd_pfn->uuid, 16);
+	pfn_sb->version_major = le16_to_cpu(1);
+	checksum = nd_sb_checksum((struct nd_gen_sb *) pfn_sb);
+	pfn_sb->checksum = cpu_to_le64(checksum);
+
+	return 0;
+ err:
+	nd_pfn->pfn_sb = NULL;
+	kfree(pfn_sb);
+	return -ENXIO;
+}
+
+static int nvdimm_namespace_detach_pfn(struct nd_namespace_common *ndns)
+{
+	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
+	struct pmem_device *pmem;
+
+	/* free pmem disk */
+	pmem = dev_get_drvdata(&nd_pfn->dev);
+	pmem_detach_disk(pmem);
+
+	/* tear down pfn range / wait for per-cpu refcount to drop */
+	unregister_dev_memmap(nd_pfn->dev_map);
+
+	/* release nd_pfn resources */
+	kfree(nd_pfn->pfn_sb);
+	nd_pfn->pfn_sb = NULL;
+
+	return 0;
+}
+
+static noinline int nvdimm_namespace_attach_pfn(struct nd_namespace_common *ndns)
+{
+	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
+	struct nd_pfn *nd_pfn = to_nd_pfn(ndns->claim);
+	struct vmem_altmap *altmap;
+	struct nd_region *nd_region;
+	struct nd_pfn_sb *pfn_sb;
+	struct pmem_device *pmem;
+	struct dev_map *dev_map;
+	phys_addr_t offset;
+	int rc;
+	struct vmem_altmap __altmap = {
+		.base_pfn = __phys_to_pfn(nsio->res.start),
+		.reserve = __phys_to_pfn(SZ_8K),
+	};
+
+	if (!nd_pfn->uuid || !nd_pfn->ndns)
+		return -ENODEV;
+
+	nd_region = to_nd_region(nd_pfn->dev.parent);
+	rc = nd_pfn_init(nd_pfn);
+	if (rc)
+		return rc;
+
+	if (PAGE_SIZE != SZ_4K) {
+		dev_err(&nd_pfn->dev, "only supported on systems with 4K PAGE_SIZE\n");
+		return -ENXIO;
+	}
+	if (nsio->res.start & ND_PFN_MASK) {
+		dev_err(&nd_pfn->dev, "%s not memory hotplug section aligned\n",
+				dev_name(&ndns->dev));
+		return -ENXIO;
+	}
+
+	pfn_sb = nd_pfn->pfn_sb;
+	offset = le64_to_cpu(pfn_sb->dataoff);
+	if (nd_pfn->mode == PFN_MODE_PMEM) {
+		nd_pfn->npfns = (resource_size(&nsio->res) - offset)
+			/ PAGE_SIZE;
+		if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
+			dev_info(&nd_pfn->dev,
+					"number of pfns truncated from %lld to %ld\n",
+					le64_to_cpu(nd_pfn->pfn_sb->npfns),
+					nd_pfn->npfns);
+		altmap = & __altmap;
+		altmap->free = __phys_to_pfn(offset - SZ_8K);
+		altmap->alloc = 0;
+	} else if (nd_pfn->mode == PFN_MODE_RAM) {
+		if (offset != SZ_8K)
+			return -EINVAL;
+		nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
+		altmap = NULL;
+	} else {
+		rc = -ENXIO;
+		goto err;
+	}
+
+	/* establish pfn range for lookup, and switch to direct map */
+	pmem = dev_get_drvdata(&nd_pfn->dev);
+	memunmap_pmem(&nd_pfn->dev, pmem->virt_addr);
+	dev_map = register_dev_memmap(&nd_pfn->dev, &nsio->res, altmap);
+	if (!dev_map) {
+		rc = -ENOMEM;
+		goto err;
+	}
+	nd_pfn->dev_map = dev_map;
+	pmem->virt_addr = (void __pmem *) __va(pmem->phys_addr);
+
+	/* attach pmem disk in "pfn-mode" */
+	pmem->data_offset = offset;
+	rc = pmem_attach_disk(&nd_pfn->dev, ndns, pmem);
+	if (rc)
+		goto err;
+
+	return rc;
+ err:
+	nvdimm_namespace_detach_pfn(ndns);
+	return rc;
+}
+
 static int nd_pmem_probe(struct device *dev)
 {
 	struct nd_region *nd_region = to_nd_region(dev->parent);
@@ -226,15 +411,25 @@ static int nd_pmem_probe(struct device *dev)
 	if (IS_ERR(pmem))
 		return PTR_ERR(pmem);
 
+	pmem->ndns = ndns;
 	dev_set_drvdata(dev, pmem);
 	ndns->rw_bytes = pmem_rw_bytes;
 
 	if (is_nd_btt(dev))
 		return nvdimm_namespace_attach_btt(ndns);
 
-	if (nd_btt_probe(ndns, pmem) == 0)
+	if (is_nd_pfn(dev))
+		return nvdimm_namespace_attach_pfn(ndns);
+
+	if (nd_btt_probe(ndns, pmem) == 0) {
 		/* we'll come back as btt-pmem */
 		return -ENXIO;
+	}
+
+	if (nd_pfn_probe(ndns, pmem) == 0) {
+		/* we'll come back as pfn-pmem */
+		return -ENXIO;
+	}
 
 	return pmem_attach_disk(dev, ndns, pmem);
 }
@@ -244,7 +439,9 @@ static int nd_pmem_remove(struct device *dev)
 	struct pmem_device *pmem = dev_get_drvdata(dev);
 
 	if (is_nd_btt(dev))
-		nvdimm_namespace_detach_btt(to_nd_btt(dev)->ndns);
+		nvdimm_namespace_detach_btt(pmem->ndns);
+	else if (is_nd_pfn(dev))
+		nvdimm_namespace_detach_pfn(pmem->ndns);
 	else
 		pmem_detach_disk(pmem);
 
diff --git a/tools/testing/nvdimm/Kbuild b/tools/testing/nvdimm/Kbuild
index 22d4d19a49bc..eff633f8b6db 100644
--- a/tools/testing/nvdimm/Kbuild
+++ b/tools/testing/nvdimm/Kbuild
@@ -1,6 +1,7 @@
 ldflags-y += --wrap=ioremap_wc
 ldflags-y += --wrap=devm_ioremap_nocache
 ldflags-y += --wrap=devm_memremap
+ldflags-y += --wrap=devm_memunmap
 ldflags-y += --wrap=ioremap_nocache
 ldflags-y += --wrap=iounmap
 ldflags-y += --wrap=__devm_request_region
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index ff1e00458864..3609f6713075 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -95,6 +95,19 @@ void *__wrap_devm_memremap(struct device *dev, resource_size_t offset,
 }
 EXPORT_SYMBOL(__wrap_devm_memremap);
 
+void __wrap_devm_memunmap(struct device *dev, void *addr)
+{
+	struct nfit_test_resource *nfit_res;
+
+	rcu_read_lock();
+	nfit_res = get_nfit_res((unsigned long) addr);
+	rcu_read_unlock();
+	if (nfit_res)
+		return;
+	return devm_memunmap(dev, addr);
+}
+EXPORT_SYMBOL(__wrap_devm_memunmap);
+
 void __iomem *__wrap_ioremap_nocache(resource_size_t offset, unsigned long size)
 {
 	return __nfit_test_ioremap(offset, size, ioremap_nocache);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-13  3:50 ` [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" Dan Williams
@ 2015-08-14 21:37   ` Jerome Glisse
  2015-08-14 21:52     ` Dan Williams
  2015-08-15 13:33   ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2015-08-14 21:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, boaz, riel, linux-nvdimm, Dave Hansen, david,
	mingo, linux-mm, Ingo Molnar, mgorman, H. Peter Anvin,
	ross.zwisler, torvalds, hch

On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> While pmem is usable as a block device or via DAX mappings to userspace
> there are several usage scenarios that can not target pmem due to its
> lack of struct page coverage. In preparation for "hot plugging" pmem
> into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
> separately from the ones that are subject to standard page allocations.
> Importantly "device memory" can be removed at will by userspace
> unbinding the driver of the device.
> 
> Having a separate zone prevents allocation and otherwise marks these
> pages that are distinct from typical uniform memory.  Device memory has
> different lifetime and performance characteristics than RAM.  However,
> since we have run out of ZONES_SHIFT bits this functionality currently
> depends on sacrificing ZONE_DMA.
> 
> arch_add_memory() is reorganized a bit in preparation for a new
> arch_add_dev_memory() api, for now there is no functional change to the
> memory hotplug code.
> 
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: linux-mm@kvack.org
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/Kconfig       |   13 +++++++++++++
>  arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
>  include/linux/mmzone.h |   23 +++++++++++++++++++++++
>  mm/memory_hotplug.c    |    5 ++++-
>  mm/page_alloc.c        |    3 +++
>  5 files changed, 64 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..64829b17980b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -308,6 +308,19 @@ config ZONE_DMA
>  
>  	  If unsure, say Y.
>  
> +config ZONE_DEVICE
> +	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
> +	default !ZONE_DMA
> +	depends on !ZONE_DMA
> +	help
> +	  Device memory hotplug support allows for establishing pmem,
> +	  or other device driver discovered memory regions, in the
> +	  memmap. This allows pfn_to_page() lookups of otherwise
> +	  "device-physical" addresses which is needed for using a DAX
> +	  mapping in an O_DIRECT operation, among other things.
> +
> +	  If FS_DAX is enabled, then say Y.
> +
>  config SMP
>  	bool "Symmetric multi-processing support"
>  	---help---
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3fba623e3ba5..94f0fa56f0ed 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -683,15 +683,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
>  	}
>  }
>  
> -/*
> - * Memory is added always to NORMAL zone. This means you will never get
> - * additional DMA/DMA32 memory.
> - */
> -int arch_add_memory(int nid, u64 start, u64 size)
> +static int __arch_add_memory(int nid, u64 start, u64 size, struct zone *zone)
>  {
> -	struct pglist_data *pgdat = NODE_DATA(nid);
> -	struct zone *zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
> @@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
>  	WARN_ON_ONCE(ret);
>  
> -	/* update max_pfn, max_low_pfn and high_memory */
> -	update_end_of_memory_vars(start, size);
> +	/*
> +	 * Update max_pfn, max_low_pfn and high_memory, unless we added
> +	 * "device memory" which should not effect max_pfn
> +	 */
> +	if (!is_dev_zone(zone))
> +		update_end_of_memory_vars(start, size);

What is the rational for not updating max_pfn, max_low_pfn, ... ?

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-14 21:37   ` Jerome Glisse
@ 2015-08-14 21:52     ` Dan Williams
  2015-08-14 22:06       ` Jerome Glisse
  2015-08-15  8:59       ` Christoph Hellwig
  0 siblings, 2 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-14 21:52 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
>> While pmem is usable as a block device or via DAX mappings to userspace
>> there are several usage scenarios that can not target pmem due to its
>> lack of struct page coverage. In preparation for "hot plugging" pmem
>> into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
>> separately from the ones that are subject to standard page allocations.
>> Importantly "device memory" can be removed at will by userspace
>> unbinding the driver of the device.
>>
>> Having a separate zone prevents allocation and otherwise marks these
>> pages that are distinct from typical uniform memory.  Device memory has
>> different lifetime and performance characteristics than RAM.  However,
>> since we have run out of ZONES_SHIFT bits this functionality currently
>> depends on sacrificing ZONE_DMA.
>>
>> arch_add_memory() is reorganized a bit in preparation for a new
>> arch_add_dev_memory() api, for now there is no functional change to the
>> memory hotplug code.
>>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  arch/x86/Kconfig       |   13 +++++++++++++
>>  arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
>>  include/linux/mmzone.h |   23 +++++++++++++++++++++++
>>  mm/memory_hotplug.c    |    5 ++++-
>>  mm/page_alloc.c        |    3 +++
>>  5 files changed, 64 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index b3a1a5d77d92..64829b17980b 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -308,6 +308,19 @@ config ZONE_DMA
>>
>>         If unsure, say Y.
>>
>> +config ZONE_DEVICE
>> +     bool "Device memory (pmem, etc...) hotplug support" if EXPERT
>> +     default !ZONE_DMA
>> +     depends on !ZONE_DMA
>> +     help
>> +       Device memory hotplug support allows for establishing pmem,
>> +       or other device driver discovered memory regions, in the
>> +       memmap. This allows pfn_to_page() lookups of otherwise
>> +       "device-physical" addresses which is needed for using a DAX
>> +       mapping in an O_DIRECT operation, among other things.
>> +
>> +       If FS_DAX is enabled, then say Y.
>> +
>>  config SMP
>>       bool "Symmetric multi-processing support"
>>       ---help---
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 3fba623e3ba5..94f0fa56f0ed 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
[..]
>> @@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>       ret = __add_pages(nid, zone, start_pfn, nr_pages);
>>       WARN_ON_ONCE(ret);
>>
>> -     /* update max_pfn, max_low_pfn and high_memory */
>> -     update_end_of_memory_vars(start, size);
>> +     /*
>> +      * Update max_pfn, max_low_pfn and high_memory, unless we added
>> +      * "device memory" which should not effect max_pfn
>> +      */
>> +     if (!is_dev_zone(zone))
>> +             update_end_of_memory_vars(start, size);
>
> What is the rational for not updating max_pfn, max_low_pfn, ... ?
>

The idea is that this memory is not meant to be available to the page
allocator and should not count as new memory capacity.  We're only
hotplugging it to get struct page coverage.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-14 21:52     ` Dan Williams
@ 2015-08-14 22:06       ` Jerome Glisse
  2015-08-14 22:33         ` Dan Williams
  2015-08-15  8:59       ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2015-08-14 22:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> >> While pmem is usable as a block device or via DAX mappings to userspace
> >> there are several usage scenarios that can not target pmem due to its
> >> lack of struct page coverage. In preparation for "hot plugging" pmem
> >> into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
> >> separately from the ones that are subject to standard page allocations.
> >> Importantly "device memory" can be removed at will by userspace
> >> unbinding the driver of the device.
> >>
> >> Having a separate zone prevents allocation and otherwise marks these
> >> pages that are distinct from typical uniform memory.  Device memory has
> >> different lifetime and performance characteristics than RAM.  However,
> >> since we have run out of ZONES_SHIFT bits this functionality currently
> >> depends on sacrificing ZONE_DMA.
> >>
> >> arch_add_memory() is reorganized a bit in preparation for a new
> >> arch_add_dev_memory() api, for now there is no functional change to the
> >> memory hotplug code.
> >>
> >> Cc: H. Peter Anvin <hpa@zytor.com>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Rik van Riel <riel@redhat.com>
> >> Cc: Mel Gorman <mgorman@suse.de>
> >> Cc: linux-mm@kvack.org
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  arch/x86/Kconfig       |   13 +++++++++++++
> >>  arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
> >>  include/linux/mmzone.h |   23 +++++++++++++++++++++++
> >>  mm/memory_hotplug.c    |    5 ++++-
> >>  mm/page_alloc.c        |    3 +++
> >>  5 files changed, 64 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index b3a1a5d77d92..64829b17980b 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -308,6 +308,19 @@ config ZONE_DMA
> >>
> >>         If unsure, say Y.
> >>
> >> +config ZONE_DEVICE
> >> +     bool "Device memory (pmem, etc...) hotplug support" if EXPERT
> >> +     default !ZONE_DMA
> >> +     depends on !ZONE_DMA
> >> +     help
> >> +       Device memory hotplug support allows for establishing pmem,
> >> +       or other device driver discovered memory regions, in the
> >> +       memmap. This allows pfn_to_page() lookups of otherwise
> >> +       "device-physical" addresses which is needed for using a DAX
> >> +       mapping in an O_DIRECT operation, among other things.
> >> +
> >> +       If FS_DAX is enabled, then say Y.
> >> +
> >>  config SMP
> >>       bool "Symmetric multi-processing support"
> >>       ---help---
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 3fba623e3ba5..94f0fa56f0ed 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> [..]
> >> @@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>       ret = __add_pages(nid, zone, start_pfn, nr_pages);
> >>       WARN_ON_ONCE(ret);
> >>
> >> -     /* update max_pfn, max_low_pfn and high_memory */
> >> -     update_end_of_memory_vars(start, size);
> >> +     /*
> >> +      * Update max_pfn, max_low_pfn and high_memory, unless we added
> >> +      * "device memory" which should not effect max_pfn
> >> +      */
> >> +     if (!is_dev_zone(zone))
> >> +             update_end_of_memory_vars(start, size);
> >
> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
> >
> 
> The idea is that this memory is not meant to be available to the page
> allocator and should not count as new memory capacity.  We're only
> hotplugging it to get struct page coverage.

But this sounds bogus to me to rely on max_pfn to stay smaller than
first_dev_pfn.  For instance you might plug a device that register
dev memory and then some regular memory might be hotplug, effectively
updating max_pfn to a value bigger than first_dev_pfn.

Also i do not think that the buddy allocator use max_pfn or max_low_pfn
to consider page/zone for allocation or not.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-14 22:06       ` Jerome Glisse
@ 2015-08-14 22:33         ` Dan Williams
  2015-08-15  2:11           ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-14 22:33 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Fri, Aug 14, 2015 at 3:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
>> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
[..]
>> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
>> >
>>
>> The idea is that this memory is not meant to be available to the page
>> allocator and should not count as new memory capacity.  We're only
>> hotplugging it to get struct page coverage.
>
> But this sounds bogus to me to rely on max_pfn to stay smaller than
> first_dev_pfn.  For instance you might plug a device that register
> dev memory and then some regular memory might be hotplug, effectively
> updating max_pfn to a value bigger than first_dev_pfn.
>

True.

> Also i do not think that the buddy allocator use max_pfn or max_low_pfn
> to consider page/zone for allocation or not.

Yes, I took it out with no effects.  I'll investigate further whether
we should be touching those variables or not for this new usage.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-14 22:33         ` Dan Williams
@ 2015-08-15  2:11           ` Dan Williams
  2015-08-17 21:45             ` Jerome Glisse
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-15  2:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Fri, Aug 14, 2015 at 3:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Aug 14, 2015 at 3:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
>>> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> [..]
>>> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
>>> >
>>>
>>> The idea is that this memory is not meant to be available to the page
>>> allocator and should not count as new memory capacity.  We're only
>>> hotplugging it to get struct page coverage.
>>
>> But this sounds bogus to me to rely on max_pfn to stay smaller than
>> first_dev_pfn.  For instance you might plug a device that register
>> dev memory and then some regular memory might be hotplug, effectively
>> updating max_pfn to a value bigger than first_dev_pfn.
>>
>
> True.
>
>> Also i do not think that the buddy allocator use max_pfn or max_low_pfn
>> to consider page/zone for allocation or not.
>
> Yes, I took it out with no effects.  I'll investigate further whether
> we should be touching those variables or not for this new usage.

Although it does not offer perfect protection if device memory is at a
physically lower address than RAM, skipping the update of these
variables does seem to be what we want.  For example /dev/mem would
fail to allow write access to persistent memory if it fails a
valid_phys_addr_range() check.  Since /dev/mem does not know how to
write to PMEM in a reliably persistent way, it should not treat a
PMEM-pfn like RAM.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-14 21:52     ` Dan Williams
  2015-08-14 22:06       ` Jerome Glisse
@ 2015-08-15  8:59       ` Christoph Hellwig
  2015-08-21 15:02         ` Dan Williams
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-15  8:59 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jerome Glisse, linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> The idea is that this memory is not meant to be available to the page
> allocator and should not count as new memory capacity.  We're only
> hotplugging it to get struct page coverage.

This might need a bigger audit of the max_pfn usages.  I remember
architectures using it as a decisions for using IOMMUs or similar.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 0/7] 'struct page' driver for persistent memory
  2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
                   ` (6 preceding siblings ...)
  2015-08-13  3:50 ` [RFC PATCH 7/7] libnvdimm, pmem: 'struct page' for pmem Dan Williams
@ 2015-08-15  9:01 ` Christoph Hellwig
  7 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-15  9:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, boaz, riel, linux-nvdimm, Dave Hansen, david,
	mingo, linux-mm, Ingo Molnar, mgorman, H. Peter Anvin,
	ross.zwisler, torvalds, hch

Hi Dan,

based on the issues with the physiscal address S/G lists I suspect
we will need this page hotplugging code.  Any chance we could side
step the issue of storing the page structs on the actual pmem for
the first round so that we can an initial version into 4.3?  I'll
help with the max_pfn audit and testing with my WIP block driver
that does I/O from pmem.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 4/7] mm: register_dev_memmap()
  2015-08-13  3:50 ` [RFC PATCH 4/7] mm: register_dev_memmap() Dan Williams
@ 2015-08-15  9:04   ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-15  9:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, boaz, riel, linux-nvdimm, Dave Hansen, david,
	mingo, linux-mm, Ingo Molnar, mgorman, H. Peter Anvin,
	ross.zwisler, torvalds, hch

>  #endif /* _LINUX_KMAP_PFN_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8a4f24d7fdb0..07152a54b841 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -939,6 +939,7 @@ typedef struct {
>   * PFN_SG_CHAIN - pfn is a pointer to the next scatterlist entry
>   * PFN_SG_LAST - pfn references a page and is the last scatterlist entry
>   * PFN_DEV - pfn is not covered by system memmap
> + * PFN_MAP - pfn is covered by a device specific memmap
>   */
>  enum {
>  	PFN_MASK = (1UL << PAGE_SHIFT) - 1,
> @@ -949,6 +950,7 @@ enum {
>  #else
>  	PFN_DEV = 0,
>  #endif
> +	PFN_MAP = (1UL << 3),
>  };
>  
>  static inline __pfn_t pfn_to_pfn_t(unsigned long pfn, unsigned long flags)
> @@ -965,7 +967,7 @@ static inline __pfn_t phys_to_pfn_t(dma_addr_t addr, unsigned long flags)
>  
>  static inline bool __pfn_t_has_page(__pfn_t pfn)
>  {
> -	return (pfn.val & PFN_DEV) == 0;
> +	return (pfn.val & PFN_DEV) == 0 || (pfn.val & PFN_MAP) == PFN_MAP;

Shouldn't we simply not set the PFN_DEV flag instead of needing another
one to cancel it out?

I also wonder if it might be better to not require the __pfn_t and
SG rework patches before this series.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-13  3:50 ` [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option Dan Williams
@ 2015-08-15  9:06   ` Christoph Hellwig
  2015-08-15 15:28     ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-15  9:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, boaz, riel, linux-nvdimm, david, mingo, linux-mm,
	mgorman, ross.zwisler, torvalds, hch

On Wed, Aug 12, 2015 at 11:50:29PM -0400, Dan Williams wrote:
> Purely for ease of testing, with this in place we can run the unit test
> alongside any tests that depend on the memmap=ss!nn kernel parameter.
> The unit test mocking implementation requires that libnvdimm be a module
> and not built-in.
> 
> A nice side effect is the implementation is a bit more generic as it no
> longer depends on <asm/e820.h>.

I really don't like this artifical split, and I also don't like how
your weird "unit tests" force even more ugliness on the kernel.  Almost
reminds of the python projects spending more effort on getting their
class mockable than actually producing results..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-13  3:50 ` [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" Dan Williams
  2015-08-14 21:37   ` Jerome Glisse
@ 2015-08-15 13:33   ` Christoph Hellwig
  1 sibling, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-15 13:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, boaz, riel, linux-nvdimm, Dave Hansen, david,
	mingo, linux-mm, Ingo Molnar, mgorman, H. Peter Anvin,
	ross.zwisler, torvalds, hch

On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> arch_add_memory() is reorganized a bit in preparation for a new
> arch_add_dev_memory() api, for now there is no functional change to the
> memory hotplug code.

Instead of the new arch_add_dev_memory call I'd just add a bool device
argument to arch_add_memory and zone_for_memory (and later the altmap
pointer aswell).

arch_add_memory is a candidate to be factored into common code,
except for s390 everything could be done with two small arch callouts.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-15  9:06   ` Christoph Hellwig
@ 2015-08-15 15:28     ` Dan Williams
  2015-08-15 15:58       ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-15 15:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, david, Ingo Molnar, Linux MM,
	Mel Gorman, Ross Zwisler, torvalds

On Sat, Aug 15, 2015 at 2:06 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Aug 12, 2015 at 11:50:29PM -0400, Dan Williams wrote:
>> Purely for ease of testing, with this in place we can run the unit test
>> alongside any tests that depend on the memmap=ss!nn kernel parameter.
>> The unit test mocking implementation requires that libnvdimm be a module
>> and not built-in.
>>
>> A nice side effect is the implementation is a bit more generic as it no
>> longer depends on <asm/e820.h>.
>
> I really don't like this artifical split, and I also don't like how
> your weird "unit tests" force even more ugliness on the kernel.  Almost
> reminds of the python projects spending more effort on getting their
> class mockable than actually producing results..

Well, the minute you see a 'struct DeviceFactory' appear in the kernel
source you can delete all the unit tests and come take away my
keyboard.  Until then can we please push probing platform resources to
a device driver ->probe() method where it belongs?  Also given the
type-7 type-12 confusion I'm just waiting for some firmware to
describe persistent memory with type-12 at the e820 level and expect
an ACPI-NFIT to be able to sub-divide it.  In that case you'd want to
blacklist either 'nd_e820.ko' or 'nfit.ko'  to resolve the conflict.
I'm not grokking the argument against allowing this functionality to
be modular.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-15 15:28     ` Dan Williams
@ 2015-08-15 15:58       ` Christoph Hellwig
  2015-08-15 16:04         ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-15 15:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, david, Ingo Molnar, Linux MM,
	Mel Gorman, Ross Zwisler, torvalds

On Sat, Aug 15, 2015 at 08:28:35AM -0700, Dan Williams wrote:
> I'm not grokking the argument against allowing this functionality to
> be modular.

You're adding a another layer of platform_devices just to make a tivially
small piece of code modular so that you can hook into it.  I don't think
that's a good reason, and neither is the after thought of preventing
potentially future buggy firmware.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-15 15:58       ` Christoph Hellwig
@ 2015-08-15 16:04         ` Dan Williams
  2015-08-17 15:01           ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-15 16:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, david, Ingo Molnar, Linux MM,
	Mel Gorman, Ross Zwisler, torvalds

On Sat, Aug 15, 2015 at 8:58 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sat, Aug 15, 2015 at 08:28:35AM -0700, Dan Williams wrote:
>> I'm not grokking the argument against allowing this functionality to
>> be modular.
>
> You're adding a another layer of platform_devices just to make a tivially
> small piece of code modular so that you can hook into it.  I don't think
> that's a good reason, and neither is the after thought of preventing
> potentially future buggy firmware.

What other layer? /sys/devices/platform/e820_pmem is that exact same
device we had before this patch.  We just have a proper driver for it
now.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-15 16:04         ` Dan Williams
@ 2015-08-17 15:01           ` Christoph Hellwig
  2015-08-17 15:32             ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2015-08-17 15:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, david, Ingo Molnar, Linux MM,
	Mel Gorman, Ross Zwisler, torvalds

On Sat, Aug 15, 2015 at 09:04:02AM -0700, Dan Williams wrote:
> What other layer? /sys/devices/platform/e820_pmem is that exact same
> device we had before this patch.  We just have a proper driver for it
> now.

We're adding another layer of indirection between the old e820 file
and the new module.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
  2015-08-17 15:01           ` Christoph Hellwig
@ 2015-08-17 15:32             ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-17 15:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, david, Ingo Molnar, Linux MM,
	Mel Gorman, Ross Zwisler, torvalds

On Mon, Aug 17, 2015 at 8:01 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sat, Aug 15, 2015 at 09:04:02AM -0700, Dan Williams wrote:
>> What other layer? /sys/devices/platform/e820_pmem is that exact same
>> device we had before this patch.  We just have a proper driver for it
>> now.
>
> We're adding another layer of indirection between the old e820 file
> and the new module.

Ok, yes, I was confused by "another layer of platform_devices".

That said here are the non-unit-test related reasons for this change
that I would include in a new changelog:

---

We currently register a platform device for e820 type-12 memory and
register a nvdimm bus beneath it.  Registering the platform device
triggers the device-core machinery to probe for a driver, but that
search currently comes up empty.  Building the nvdimm-bus registration
into the e820_pmem platform device registration in this way forces
libnvdimm to be built-in.  Instead, convert the built-in portion of
CONFIG_X86_PMEM_LEGACY to simply register a platform device and move
the rest of the logic to the driver for e820_pmem, for the following
reasons:

1/ Letting libnvdimm be a module allows building and testing
libnvdimm.ko changes without rebooting

2/ All the normal policy around modules can be applied to e820_pmem
(unbind to disable and/or blacklisting the module from loading by
default)

3/ Moving the driver to a generic location and converting it to scan
"iomem_resource" rather than "e820.map" means any other architecture
can take advantage of this simple nvdimm resource discovery mechanism
by registering a resource named "Persistent Memory (legacy)"

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-15  2:11           ` Dan Williams
@ 2015-08-17 21:45             ` Jerome Glisse
  2015-08-18  0:46               ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2015-08-17 21:45 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2336 bytes --]

On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
> On Fri, Aug 14, 2015 at 3:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Fri, Aug 14, 2015 at 3:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> >>> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> > [..]
> >>> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
> >>> >
> >>>
> >>> The idea is that this memory is not meant to be available to the page
> >>> allocator and should not count as new memory capacity.  We're only
> >>> hotplugging it to get struct page coverage.
> >>
> >> But this sounds bogus to me to rely on max_pfn to stay smaller than
> >> first_dev_pfn.  For instance you might plug a device that register
> >> dev memory and then some regular memory might be hotplug, effectively
> >> updating max_pfn to a value bigger than first_dev_pfn.
> >>
> >
> > True.
> >
> >> Also i do not think that the buddy allocator use max_pfn or max_low_pfn
> >> to consider page/zone for allocation or not.
> >
> > Yes, I took it out with no effects.  I'll investigate further whether
> > we should be touching those variables or not for this new usage.
> 
> Although it does not offer perfect protection if device memory is at a
> physically lower address than RAM, skipping the update of these
> variables does seem to be what we want.  For example /dev/mem would
> fail to allow write access to persistent memory if it fails a
> valid_phys_addr_range() check.  Since /dev/mem does not know how to
> write to PMEM in a reliably persistent way, it should not treat a
> PMEM-pfn like RAM.

So i attach is a patch that should keep ZONE_DEVICE out of consideration
for the buddy allocator. You might also want to keep page reserved and not
free inside the zone, you could replace the generic_online_page() using
set_online_page_callback() while hotpluging device memory.

Regarding /dev/mem i would not worry about highmem, as /dev/mem is already
broken in respect to memory hole that might exist (at least that is my
understanding). Alternatively if you really care about /dev/mem you could
add an arch valid_phys_addr_range() that could check valid zone.

Cheers,
Jérôme

[-- Attachment #2: 0001-mm-ZONE_DEVICE-Keep-ZONE_DEVICE-out-of-allocation-zo.patch --]
[-- Type: text/plain, Size: 1256 bytes --]

>From 45976e1186eee45ecb277fe5293a7cfa7466d740 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Mon, 17 Aug 2015 17:31:27 -0400
Subject: [PATCH] mm/ZONE_DEVICE: Keep ZONE_DEVICE out of allocation zonelist.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Memory inside a ZONE_DEVICE should never be consider by the buddy
allocator and thus any such zone should never be added to any of
the zonelist. This patch just do that.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22..f3e26de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3834,6 +3834,13 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
+		/*
+		 * Device zone is special memory and should never be consider
+		 * for regular allocation. It is expected that page in device
+		 * zone will be allocated by other means.
+		 */
+		if (is_dev_zone(zone))
+			continue;
 		if (populated_zone(zone)) {
 			zoneref_set_zone(zone,
 				&zonelist->_zonerefs[nr_zones++]);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-17 21:45             ` Jerome Glisse
@ 2015-08-18  0:46               ` Dan Williams
  2015-08-18 16:55                 ` Jerome Glisse
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-18  0:46 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
>> Although it does not offer perfect protection if device memory is at a
>> physically lower address than RAM, skipping the update of these
>> variables does seem to be what we want.  For example /dev/mem would
>> fail to allow write access to persistent memory if it fails a
>> valid_phys_addr_range() check.  Since /dev/mem does not know how to
>> write to PMEM in a reliably persistent way, it should not treat a
>> PMEM-pfn like RAM.
>
> So i attach is a patch that should keep ZONE_DEVICE out of consideration
> for the buddy allocator. You might also want to keep page reserved and not
> free inside the zone, you could replace the generic_online_page() using
> set_online_page_callback() while hotpluging device memory.
>

Hmm, are we already protected by the fact that ZONE_DEVICE is not
represented in the GFP_ZONEMASK?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-18  0:46               ` Dan Williams
@ 2015-08-18 16:55                 ` Jerome Glisse
  2015-08-18 17:23                   ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2015-08-18 16:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote:
> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
> >> Although it does not offer perfect protection if device memory is at a
> >> physically lower address than RAM, skipping the update of these
> >> variables does seem to be what we want.  For example /dev/mem would
> >> fail to allow write access to persistent memory if it fails a
> >> valid_phys_addr_range() check.  Since /dev/mem does not know how to
> >> write to PMEM in a reliably persistent way, it should not treat a
> >> PMEM-pfn like RAM.
> >
> > So i attach is a patch that should keep ZONE_DEVICE out of consideration
> > for the buddy allocator. You might also want to keep page reserved and not
> > free inside the zone, you could replace the generic_online_page() using
> > set_online_page_callback() while hotpluging device memory.
> >
> 
> Hmm, are we already protected by the fact that ZONE_DEVICE is not
> represented in the GFP_ZONEMASK?

Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will
always limit which zones are considered. I thought that under memory presure
it would go over all of the zonelist entry and eventualy consider the device
zone. But it doesn't seems to be that way.

Keeping the device zone out of the zonelist might still be a good idea, if
only to avoid pointless iteration for the page allocator. Unless someone can
think of a reason why this would be bad.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-18 16:55                 ` Jerome Glisse
@ 2015-08-18 17:23                   ` Dan Williams
  2015-08-18 19:06                     ` Jerome Glisse
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-18 17:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Tue, Aug 18, 2015 at 9:55 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote:
>> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
>> >> Although it does not offer perfect protection if device memory is at a
>> >> physically lower address than RAM, skipping the update of these
>> >> variables does seem to be what we want.  For example /dev/mem would
>> >> fail to allow write access to persistent memory if it fails a
>> >> valid_phys_addr_range() check.  Since /dev/mem does not know how to
>> >> write to PMEM in a reliably persistent way, it should not treat a
>> >> PMEM-pfn like RAM.
>> >
>> > So i attach is a patch that should keep ZONE_DEVICE out of consideration
>> > for the buddy allocator. You might also want to keep page reserved and not
>> > free inside the zone, you could replace the generic_online_page() using
>> > set_online_page_callback() while hotpluging device memory.
>> >
>>
>> Hmm, are we already protected by the fact that ZONE_DEVICE is not
>> represented in the GFP_ZONEMASK?
>
> Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will
> always limit which zones are considered. I thought that under memory presure
> it would go over all of the zonelist entry and eventualy consider the device
> zone. But it doesn't seems to be that way.
>
> Keeping the device zone out of the zonelist might still be a good idea, if
> only to avoid pointless iteration for the page allocator. Unless someone can
> think of a reason why this would be bad.
>

The other question I have is whether disabling ZONE_DMA is a realistic
tradeoff for enabling ZONE_DEVICE?  I.e. can ZONE_DMA default to off
going forward, lose some ISA device support, or do we need to figure
out how to enable > 4 zones.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-18 17:23                   ` Dan Williams
@ 2015-08-18 19:06                     ` Jerome Glisse
  2015-08-20  0:49                       ` Dan Williams
  0 siblings, 1 reply; 31+ messages in thread
From: Jerome Glisse @ 2015-08-18 19:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Tue, Aug 18, 2015 at 10:23:38AM -0700, Dan Williams wrote:
> On Tue, Aug 18, 2015 at 9:55 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote:
> >> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
> >> >> Although it does not offer perfect protection if device memory is at a
> >> >> physically lower address than RAM, skipping the update of these
> >> >> variables does seem to be what we want.  For example /dev/mem would
> >> >> fail to allow write access to persistent memory if it fails a
> >> >> valid_phys_addr_range() check.  Since /dev/mem does not know how to
> >> >> write to PMEM in a reliably persistent way, it should not treat a
> >> >> PMEM-pfn like RAM.
> >> >
> >> > So i attach is a patch that should keep ZONE_DEVICE out of consideration
> >> > for the buddy allocator. You might also want to keep page reserved and not
> >> > free inside the zone, you could replace the generic_online_page() using
> >> > set_online_page_callback() while hotpluging device memory.
> >> >
> >>
> >> Hmm, are we already protected by the fact that ZONE_DEVICE is not
> >> represented in the GFP_ZONEMASK?
> >
> > Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will
> > always limit which zones are considered. I thought that under memory presure
> > it would go over all of the zonelist entry and eventualy consider the device
> > zone. But it doesn't seems to be that way.
> >
> > Keeping the device zone out of the zonelist might still be a good idea, if
> > only to avoid pointless iteration for the page allocator. Unless someone can
> > think of a reason why this would be bad.
> >
> 
> The other question I have is whether disabling ZONE_DMA is a realistic
> tradeoff for enabling ZONE_DEVICE?  I.e. can ZONE_DMA default to off
> going forward, lose some ISA device support, or do we need to figure
> out how to enable > 4 zones.

That require some auditing a quick look and it seems to matter for s390
arch and there is still few driver that use it. I think we can forget
about ISA bus, i would be surprise if you could still run a recent kernel
on a computer that has ISA bus.

Thought maybe you don't need a new ZONE_DEV and all you need is valid
struct page for this device memory, and you don't want this page to be
useable by the general memory allocator. There is surely other ways to
achieve that like marking all as reserved when you hotplug them.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-18 19:06                     ` Jerome Glisse
@ 2015-08-20  0:49                       ` Dan Williams
  0 siblings, 0 replies; 31+ messages in thread
From: Dan Williams @ 2015-08-20  0:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, Christoph Hellwig

On Tue, Aug 18, 2015 at 12:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Tue, Aug 18, 2015 at 10:23:38AM -0700, Dan Williams wrote:
> Thought maybe you don't need a new ZONE_DEV and all you need is valid
> struct page for this device memory, and you don't want this page to be
> useable by the general memory allocator. There is surely other ways to
> achieve that like marking all as reserved when you hotplug them.
>

Yes, there are other ways that can achieve the same thing, but I do
like the ability to do reverse page to zone lookups for debug if
anything.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-15  8:59       ` Christoph Hellwig
@ 2015-08-21 15:02         ` Dan Williams
  2015-08-21 15:15           ` Jerome Glisse
  0 siblings, 1 reply; 31+ messages in thread
From: Dan Williams @ 2015-08-21 15:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, David Woodhouse

[ Adding David Woodhouse ]

On Sat, Aug 15, 2015 at 1:59 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
>> The idea is that this memory is not meant to be available to the page
>> allocator and should not count as new memory capacity.  We're only
>> hotplugging it to get struct page coverage.
>
> This might need a bigger audit of the max_pfn usages.  I remember
> architectures using it as a decisions for using IOMMUs or similar.

We chatted about this at LPC yesterday.  The takeaway was that the
max_pfn checks that the IOMMU code does is for checking whether a
device needs an io-virtual mapping to reach addresses above its DMA
limit (if it can't do 64-bit DMA).  Given the capacities of persistent
memory it's likely that a device with this limitation already can't
address all of RAM let alone PMEM.   So it seems to me that updating
max_pfn for PMEM hotplug does not buy us anything except a few more
opportunities to confuse PMEM as typical RAM.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory"
  2015-08-21 15:02         ` Dan Williams
@ 2015-08-21 15:15           ` Jerome Glisse
  0 siblings, 0 replies; 31+ messages in thread
From: Jerome Glisse @ 2015-08-21 15:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-kernel, Boaz Harrosh, Rik van Riel,
	linux-nvdimm@lists.01.org, Dave Hansen, david, Ingo Molnar,
	Linux MM, Ingo Molnar, Mel Gorman, H. Peter Anvin, Ross Zwisler,
	torvalds, David Woodhouse

On Fri, Aug 21, 2015 at 08:02:51AM -0700, Dan Williams wrote:
> [ Adding David Woodhouse ]
> 
> On Sat, Aug 15, 2015 at 1:59 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> >> The idea is that this memory is not meant to be available to the page
> >> allocator and should not count as new memory capacity.  We're only
> >> hotplugging it to get struct page coverage.
> >
> > This might need a bigger audit of the max_pfn usages.  I remember
> > architectures using it as a decisions for using IOMMUs or similar.
> 
> We chatted about this at LPC yesterday.  The takeaway was that the
> max_pfn checks that the IOMMU code does is for checking whether a
> device needs an io-virtual mapping to reach addresses above its DMA
> limit (if it can't do 64-bit DMA).  Given the capacities of persistent
> memory it's likely that a device with this limitation already can't
> address all of RAM let alone PMEM.   So it seems to me that updating
> max_pfn for PMEM hotplug does not buy us anything except a few more
> opportunities to confuse PMEM as typical RAM.

I think it is wrong do not update max_pfn for 3 reasons :
  - In some case your PMEM memory will end up below current max_pfn
    value so device doing DMA can confuse your PMEM for regular RAM.
  - Given the above, not updating PMEM means you are not consistant,
    ie on some computer PMEM will be DMA addressable by device and
    on other computer it would not. Because different RAM and PMEM
    configuration, different bios, ... that cause max_pfn to be below
    range where PMEM get hotpluged.
  - Last why would we want to block device to access PMEM directly ?
    Wouldn't it make sense for some device like say network to read
    PMEM directly and stream it over the network ? All this happening
    through IOMMU (i am assuming PCIE network card using IOMMU). Which
    imply having the IOMMU consider this like regular mapping (ignoring
    Will Davis recent patchset here that might affect the IOMMU max_pfn
    test).

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-08-21 15:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-13  3:50 [RFC PATCH 0/7] 'struct page' driver for persistent memory Dan Williams
2015-08-13  3:50 ` [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" Dan Williams
2015-08-14 21:37   ` Jerome Glisse
2015-08-14 21:52     ` Dan Williams
2015-08-14 22:06       ` Jerome Glisse
2015-08-14 22:33         ` Dan Williams
2015-08-15  2:11           ` Dan Williams
2015-08-17 21:45             ` Jerome Glisse
2015-08-18  0:46               ` Dan Williams
2015-08-18 16:55                 ` Jerome Glisse
2015-08-18 17:23                   ` Dan Williams
2015-08-18 19:06                     ` Jerome Glisse
2015-08-20  0:49                       ` Dan Williams
2015-08-15  8:59       ` Christoph Hellwig
2015-08-21 15:02         ` Dan Williams
2015-08-21 15:15           ` Jerome Glisse
2015-08-15 13:33   ` Christoph Hellwig
2015-08-13  3:50 ` [RFC PATCH 2/7] x86, mm: introduce struct vmem_altmap Dan Williams
2015-08-13  3:50 ` [RFC PATCH 3/7] x86, mm: arch_add_dev_memory() Dan Williams
2015-08-13  3:50 ` [RFC PATCH 4/7] mm: register_dev_memmap() Dan Williams
2015-08-15  9:04   ` Christoph Hellwig
2015-08-13  3:50 ` [RFC PATCH 5/7] libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option Dan Williams
2015-08-15  9:06   ` Christoph Hellwig
2015-08-15 15:28     ` Dan Williams
2015-08-15 15:58       ` Christoph Hellwig
2015-08-15 16:04         ` Dan Williams
2015-08-17 15:01           ` Christoph Hellwig
2015-08-17 15:32             ` Dan Williams
2015-08-13  3:50 ` [RFC PATCH 6/7] libnvdimm, pfn: 'struct page' provider infrastructure Dan Williams
2015-08-13  3:50 ` [RFC PATCH 7/7] libnvdimm, pmem: 'struct page' for pmem Dan Williams
2015-08-15  9:01 ` [RFC PATCH 0/7] 'struct page' driver for persistent memory Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).