All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
@ 2018-05-08  2:30 ` Huaisheng Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng Ye @ 2018-05-08  2:30 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, chengnt, hehy1, linux-kernel,
	linux-nvdimm, Huaisheng Ye

Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
DEVICE zone, which is a virtual zone and both its start and end of pfn
are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
corresponding drivers, which locate at \drivers\nvdimm\ and
\drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
memory hot plug implementation.

With current kernel, many mm’s classical features like the buddy
system, swap mechanism and page cache couldn’t be supported to NVDIMM.
What we are doing is to expand kernel mm’s capacity to make it to handle
NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
separately, that means mm can only put the critical pages to NVDIMM
zone, here we created a new zone type as NVM zone. That is to say for
traditional(or normal) pages which would be stored at DRAM scope like
Normal, DMA32 and DMA zones. But for the critical pages, which we hope
them could be recovered from power fail or system crash, we make them
to be persistent by storing them to NVM zone.

We installed two NVDIMMs to Lenovo Thinksystem product as development
platform, which has 125GB storage capacity respectively. With these
patches below, mm can create NVM zones for NVDIMMs.

Here is dmesg info,
 Initmem setup node 0 [mem 0x0000000000001000-0x000000237fffffff]
 On node 0 totalpages: 36879666
   DMA zone: 64 pages used for memmap
   DMA zone: 23 pages reserved
   DMA zone: 3999 pages, LIFO batch:0
 mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096
   DMA32 zone: 10935 pages used for memmap
   DMA32 zone: 699795 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
   Normal zone: 53248 pages used for memmap
   Normal zone: 3407872 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
   NVM zone: 512000 pages used for memmap
   NVM zone: 32768000 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 0 zone 3 pfns 4456448 -> 37224448
 Initmem setup node 1 [mem 0x0000002380000000-0x00000046bfffffff]
 On node 1 totalpages: 36962304
   Normal zone: 65536 pages used for memmap
   Normal zone: 4194304 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752
   NVM zone: 512000 pages used for memmap
   NVM zone: 32768000 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 1 zone 3 pfns 41418752 -> 74186752

This comes /proc/zoneinfo
Node 0, zone      NVM
  pages free     32768000
        min      15244
        low      48012
        high     80780
        spanned  32768000
        present  32768000
        managed  32768000
        protection: (0, 0, 0, 0, 0, 0)
        nr_free_pages 32768000
Node 1, zone      NVM
  pages free     32768000
        min      15244
        low      48012
        high     80780
        spanned  32768000
        present  32768000
        managed  32768000


Huaisheng Ye (6):
  mm/memblock: Expand definition of flags to support NVDIMM
  mm/page_alloc.c: get pfn range with flags of memblock
  mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  arch/x86/kernel: mark NVDIMM regions from e820_table
  mm: get zone spanned pages separately for DRAM and NVDIMM
  arch/x86/mm: create page table mapping for DRAM and NVDIMM both

 arch/x86/include/asm/e820/api.h |  3 +++
 arch/x86/kernel/e820.c          | 20 +++++++++++++-
 arch/x86/kernel/setup.c         |  8 ++++++
 arch/x86/mm/init_64.c           | 16 +++++++++++
 include/linux/gfp.h             | 57 ++++++++++++++++++++++++++++++++++++---
 include/linux/memblock.h        | 19 +++++++++++++
 include/linux/mm.h              |  4 +++
 include/linux/mmzone.h          |  3 +++
 mm/Kconfig                      | 16 +++++++++++
 mm/memblock.c                   | 46 +++++++++++++++++++++++++++----
 mm/nobootmem.c                  |  5 ++--
 mm/page_alloc.c                 | 60 ++++++++++++++++++++++++++++++++++++++++-
 12 files changed, 245 insertions(+), 12 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
@ 2018-05-08  2:30 ` Huaisheng Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng Ye @ 2018-05-08  2:30 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, chengnt, hehy1, linux-kernel,
	linux-nvdimm, Huaisheng Ye

Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
DEVICE zone, which is a virtual zone and both its start and end of pfn
are equal to 0, mm wouldna??t manage NVDIMM directly as DRAM, kernel uses
corresponding drivers, which locate at \drivers\nvdimm\ and
\drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
memory hot plug implementation.

With current kernel, many mma??s classical features like the buddy
system, swap mechanism and page cache couldna??t be supported to NVDIMM.
What we are doing is to expand kernel mma??s capacity to make it to handle
NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
separately, that means mm can only put the critical pages to NVDIMM
zone, here we created a new zone type as NVM zone. That is to say for
traditional(or normal) pages which would be stored at DRAM scope like
Normal, DMA32 and DMA zones. But for the critical pages, which we hope
them could be recovered from power fail or system crash, we make them
to be persistent by storing them to NVM zone.

We installed two NVDIMMs to Lenovo Thinksystem product as development
platform, which has 125GB storage capacity respectively. With these
patches below, mm can create NVM zones for NVDIMMs.

Here is dmesg info,
 Initmem setup node 0 [mem 0x0000000000001000-0x000000237fffffff]
 On node 0 totalpages: 36879666
   DMA zone: 64 pages used for memmap
   DMA zone: 23 pages reserved
   DMA zone: 3999 pages, LIFO batch:0
 mminit::memmap_init Initialising map node 0 zone 0 pfns 1 -> 4096
   DMA32 zone: 10935 pages used for memmap
   DMA32 zone: 699795 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 0 zone 1 pfns 4096 -> 1048576
   Normal zone: 53248 pages used for memmap
   Normal zone: 3407872 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 0 zone 2 pfns 1048576 -> 4456448
   NVM zone: 512000 pages used for memmap
   NVM zone: 32768000 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 0 zone 3 pfns 4456448 -> 37224448
 Initmem setup node 1 [mem 0x0000002380000000-0x00000046bfffffff]
 On node 1 totalpages: 36962304
   Normal zone: 65536 pages used for memmap
   Normal zone: 4194304 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 1 zone 2 pfns 37224448 -> 41418752
   NVM zone: 512000 pages used for memmap
   NVM zone: 32768000 pages, LIFO batch:31
 mminit::memmap_init Initialising map node 1 zone 3 pfns 41418752 -> 74186752

This comes /proc/zoneinfo
Node 0, zone      NVM
  pages free     32768000
        min      15244
        low      48012
        high     80780
        spanned  32768000
        present  32768000
        managed  32768000
        protection: (0, 0, 0, 0, 0, 0)
        nr_free_pages 32768000
Node 1, zone      NVM
  pages free     32768000
        min      15244
        low      48012
        high     80780
        spanned  32768000
        present  32768000
        managed  32768000


Huaisheng Ye (6):
  mm/memblock: Expand definition of flags to support NVDIMM
  mm/page_alloc.c: get pfn range with flags of memblock
  mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  arch/x86/kernel: mark NVDIMM regions from e820_table
  mm: get zone spanned pages separately for DRAM and NVDIMM
  arch/x86/mm: create page table mapping for DRAM and NVDIMM both

 arch/x86/include/asm/e820/api.h |  3 +++
 arch/x86/kernel/e820.c          | 20 +++++++++++++-
 arch/x86/kernel/setup.c         |  8 ++++++
 arch/x86/mm/init_64.c           | 16 +++++++++++
 include/linux/gfp.h             | 57 ++++++++++++++++++++++++++++++++++++---
 include/linux/memblock.h        | 19 +++++++++++++
 include/linux/mm.h              |  4 +++
 include/linux/mmzone.h          |  3 +++
 mm/Kconfig                      | 16 +++++++++++
 mm/memblock.c                   | 46 +++++++++++++++++++++++++++----
 mm/nobootmem.c                  |  5 ++--
 mm/page_alloc.c                 | 60 ++++++++++++++++++++++++++++++++++++++++-
 12 files changed, 245 insertions(+), 12 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 1/6] mm/memblock: Expand definition of flags to support NVDIMM
       [not found] ` <1525746628-114136-2-git-send-email-yehs1@lenovo.com>
@ 2018-05-08  2:30     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:30 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka

This patch makes mm to have capability to get special regions
from memblock.

During boot process, memblock marks NVDIMM regions with flag
MEMBLOCK_NVDIMM, also expands the interface of functions and
macros with flags.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 include/linux/memblock.h | 19 +++++++++++++++++++
 mm/memblock.c            | 46 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f92ea77..cade5c8d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -26,6 +26,8 @@ enum {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+	MEMBLOCK_NVDIMM		= 0x8,	/* NVDIMM region */
+	MEMBLOCK_MAX_TYPE	= 0x10	/* all regions */
 };
 
 struct memblock_region {
@@ -89,6 +91,8 @@ bool memblock_overlaps_region(struct memblock_type *type,
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_nvdimm(phys_addr_t base, phys_addr_t size);
+int memblock_clear_nvdimm(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
 
 /* Low level functions */
@@ -167,6 +171,11 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_reserved_mem_region(&i, p_start, p_end))
 
+static inline bool memblock_is_nvdimm(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_NVDIMM;
+}
+
 static inline bool memblock_is_hotpluggable(struct memblock_region *m)
 {
 	return m->flags & MEMBLOCK_HOTPLUG;
@@ -187,6 +196,11 @@ int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
+void __next_mem_pfn_range_with_flags(int *idx, int nid,
+				     unsigned long *out_start_pfn,
+				     unsigned long *out_end_pfn,
+				     int *out_nid,
+				     unsigned long flags);
 
 /**
  * for_each_mem_pfn_range - early memory pfn range iterator
@@ -201,6 +215,11 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
 	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
 	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
+
+#define for_each_mem_pfn_range_with_flags(i, nid, p_start, p_end, p_nid, flags) \
+	for (i = -1, __next_mem_pfn_range_with_flags(&i, nid, p_start, p_end, p_nid, flags);\
+	     i >= 0; __next_mem_pfn_range_with_flags(&i, nid, p_start, p_end, p_nid, flags))
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
diff --git a/mm/memblock.c b/mm/memblock.c
index 48376bd..7699637 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -771,6 +771,16 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG);
 }
 
+int __init_memblock memblock_mark_nvdimm(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_NVDIMM);
+}
+
+int __init_memblock memblock_clear_nvdimm(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NVDIMM);
+}
+
 /**
  * memblock_mark_mirror - Mark mirrored memory with flag MEMBLOCK_MIRROR.
  * @base: the base phys addr of the region
@@ -891,6 +901,10 @@ void __init_memblock __next_mem_range(u64 *idx, int nid, ulong flags,
 		if (nid != NUMA_NO_NODE && nid != m_nid)
 			continue;
 
+		/* skip nvdimm memory regions if needed */
+		if (!(flags & MEMBLOCK_NVDIMM) && memblock_is_nvdimm(m))
+			continue;
+
 		/* skip hotpluggable memory regions if needed */
 		if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
 			continue;
@@ -1007,6 +1021,10 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags,
 		if (nid != NUMA_NO_NODE && nid != m_nid)
 			continue;
 
+		/* skip nvdimm memory regions if needed */
+		if (!(flags & MEMBLOCK_NVDIMM) && memblock_is_nvdimm(m))
+			continue;
+
 		/* skip hotpluggable memory regions if needed */
 		if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
 			continue;
@@ -1070,12 +1088,9 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags,
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-/*
- * Common iterator interface used to define for_each_mem_range().
- */
-void __init_memblock __next_mem_pfn_range(int *idx, int nid,
+void __init_memblock __next_mem_pfn_range_with_flags(int *idx, int nid,
 				unsigned long *out_start_pfn,
-				unsigned long *out_end_pfn, int *out_nid)
+				unsigned long *out_end_pfn, int *out_nid, unsigned long flags)
 {
 	struct memblock_type *type = &memblock.memory;
 	struct memblock_region *r;
@@ -1085,6 +1100,16 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
 
 		if (PFN_UP(r->base) >= PFN_DOWN(r->base + r->size))
 			continue;
+
+		/*
+		 *  Use "flags & r->flags " to find region with multi-flags
+		 *  Use "flags == r->flags" to include region flags of MEMBLOCK_NONE
+		 *  Set flags = MEMBLOCK_MAX_TYPE to ignore to check flags
+		 */
+
+		if ((flags != MEMBLOCK_MAX_TYPE) && (flags != r->flags) && !(flags & r->flags))
+			continue;
+
 		if (nid == MAX_NUMNODES || nid == r->nid)
 			break;
 	}
@@ -1101,6 +1126,17 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
 		*out_nid = r->nid;
 }
 
+/*
+ * Common iterator interface used to define for_each_mem_range().
+ */
+void __init_memblock __next_mem_pfn_range(int *idx, int nid,
+				unsigned long *out_start_pfn,
+				unsigned long *out_end_pfn, int *out_nid)
+{
+	__next_mem_pfn_range_with_flags(idx, nid, out_start_pfn, out_end_pfn,
+						out_nid, MEMBLOCK_MAX_TYPE);
+}
+
 /**
  * memblock_set_node - set node ID on memblock regions
  * @base: base of area to set node ID for
-- 
1.8.3.1

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* RE: [External]  [RFC PATCH v1 1/6] mm/memblock: Expand definition of flags to support NVDIMM
@ 2018-05-08  2:30     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:30 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm

This patch makes mm to have capability to get special regions
from memblock.

During boot process, memblock marks NVDIMM regions with flag
MEMBLOCK_NVDIMM, also expands the interface of functions and
macros with flags.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 include/linux/memblock.h | 19 +++++++++++++++++++
 mm/memblock.c            | 46 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f92ea77..cade5c8d 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -26,6 +26,8 @@ enum {
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+	MEMBLOCK_NVDIMM		= 0x8,	/* NVDIMM region */
+	MEMBLOCK_MAX_TYPE	= 0x10	/* all regions */
 };
 
 struct memblock_region {
@@ -89,6 +91,8 @@ bool memblock_overlaps_region(struct memblock_type *type,
 int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
 int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_nvdimm(phys_addr_t base, phys_addr_t size);
+int memblock_clear_nvdimm(phys_addr_t base, phys_addr_t size);
 ulong choose_memblock_flags(void);
 
 /* Low level functions */
@@ -167,6 +171,11 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
 	     i != (u64)ULLONG_MAX;					\
 	     __next_reserved_mem_region(&i, p_start, p_end))
 
+static inline bool memblock_is_nvdimm(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_NVDIMM;
+}
+
 static inline bool memblock_is_hotpluggable(struct memblock_region *m)
 {
 	return m->flags & MEMBLOCK_HOTPLUG;
@@ -187,6 +196,11 @@ int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 			  unsigned long *out_end_pfn, int *out_nid);
+void __next_mem_pfn_range_with_flags(int *idx, int nid,
+				     unsigned long *out_start_pfn,
+				     unsigned long *out_end_pfn,
+				     int *out_nid,
+				     unsigned long flags);
 
 /**
  * for_each_mem_pfn_range - early memory pfn range iterator
@@ -201,6 +215,11 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
 #define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)		\
 	for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \
 	     i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
+
+#define for_each_mem_pfn_range_with_flags(i, nid, p_start, p_end, p_nid, flags) \
+	for (i = -1, __next_mem_pfn_range_with_flags(&i, nid, p_start, p_end, p_nid, flags);\
+	     i >= 0; __next_mem_pfn_range_with_flags(&i, nid, p_start, p_end, p_nid, flags))
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 /**
diff --git a/mm/memblock.c b/mm/memblock.c
index 48376bd..7699637 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -771,6 +771,16 @@ int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
 	return memblock_setclr_flag(base, size, 0, MEMBLOCK_HOTPLUG);
 }
 
+int __init_memblock memblock_mark_nvdimm(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 1, MEMBLOCK_NVDIMM);
+}
+
+int __init_memblock memblock_clear_nvdimm(phys_addr_t base, phys_addr_t size)
+{
+	return memblock_setclr_flag(base, size, 0, MEMBLOCK_NVDIMM);
+}
+
 /**
  * memblock_mark_mirror - Mark mirrored memory with flag MEMBLOCK_MIRROR.
  * @base: the base phys addr of the region
@@ -891,6 +901,10 @@ void __init_memblock __next_mem_range(u64 *idx, int nid, ulong flags,
 		if (nid != NUMA_NO_NODE && nid != m_nid)
 			continue;
 
+		/* skip nvdimm memory regions if needed */
+		if (!(flags & MEMBLOCK_NVDIMM) && memblock_is_nvdimm(m))
+			continue;
+
 		/* skip hotpluggable memory regions if needed */
 		if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
 			continue;
@@ -1007,6 +1021,10 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags,
 		if (nid != NUMA_NO_NODE && nid != m_nid)
 			continue;
 
+		/* skip nvdimm memory regions if needed */
+		if (!(flags & MEMBLOCK_NVDIMM) && memblock_is_nvdimm(m))
+			continue;
+
 		/* skip hotpluggable memory regions if needed */
 		if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
 			continue;
@@ -1070,12 +1088,9 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, ulong flags,
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-/*
- * Common iterator interface used to define for_each_mem_range().
- */
-void __init_memblock __next_mem_pfn_range(int *idx, int nid,
+void __init_memblock __next_mem_pfn_range_with_flags(int *idx, int nid,
 				unsigned long *out_start_pfn,
-				unsigned long *out_end_pfn, int *out_nid)
+				unsigned long *out_end_pfn, int *out_nid, unsigned long flags)
 {
 	struct memblock_type *type = &memblock.memory;
 	struct memblock_region *r;
@@ -1085,6 +1100,16 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
 
 		if (PFN_UP(r->base) >= PFN_DOWN(r->base + r->size))
 			continue;
+
+		/*
+		 *  Use "flags & r->flags " to find region with multi-flags
+		 *  Use "flags == r->flags" to include region flags of MEMBLOCK_NONE
+		 *  Set flags = MEMBLOCK_MAX_TYPE to ignore to check flags
+		 */
+
+		if ((flags != MEMBLOCK_MAX_TYPE) && (flags != r->flags) && !(flags & r->flags))
+			continue;
+
 		if (nid == MAX_NUMNODES || nid == r->nid)
 			break;
 	}
@@ -1101,6 +1126,17 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
 		*out_nid = r->nid;
 }
 
+/*
+ * Common iterator interface used to define for_each_mem_range().
+ */
+void __init_memblock __next_mem_pfn_range(int *idx, int nid,
+				unsigned long *out_start_pfn,
+				unsigned long *out_end_pfn, int *out_nid)
+{
+	__next_mem_pfn_range_with_flags(idx, nid, out_start_pfn, out_end_pfn,
+						out_nid, MEMBLOCK_MAX_TYPE);
+}
+
 /**
  * memblock_set_node - set node ID on memblock regions
  * @base: base of area to set node ID for
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH v1 4/6] arch/x86/kernel: mark NVDIMM regions from e820_table
  2018-05-08  2:30 ` Huaisheng Ye
  (?)
  (?)
@ 2018-05-08  2:30 ` Huaisheng Ye
  -1 siblings, 0 replies; 31+ messages in thread
From: Huaisheng Ye @ 2018-05-08  2:30 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, chengnt, hehy1, linux-kernel,
	linux-nvdimm, Huaisheng Ye

During e820__memblock_setup memblock gets entries with type
E820_TYPE_RAM, E820_TYPE_RESERVED_KERN and E820_TYPE_PMEM from
e820_table, then marks NVDIMM regions with flag MEMBLOCK_NVDIMM.

Create function as e820__end_of_nvm_pfn to calculate max_pfn with
NVDIMM region, while zone_sizes_init needs max_pfn to get
arch_zone_lowest/highest_possible_pfn. During free_area_init_nodes,
the possible pfns need to be recalculated for ZONE_NVM.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 arch/x86/include/asm/e820/api.h |  3 +++
 arch/x86/kernel/e820.c          | 20 +++++++++++++++++++-
 arch/x86/kernel/setup.c         |  8 ++++++++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/e820/api.h b/arch/x86/include/asm/e820/api.h
index 62be73b..b8006c3 100644
--- a/arch/x86/include/asm/e820/api.h
+++ b/arch/x86/include/asm/e820/api.h
@@ -22,6 +22,9 @@
 extern void e820__update_table_print(void);
 
 extern unsigned long e820__end_of_ram_pfn(void);
+#ifdef CONFIG_ZONE_NVM
+extern unsigned long e820__end_of_nvm_pfn(void);
+#endif
 extern unsigned long e820__end_of_low_ram_pfn(void);
 
 extern u64  e820__memblock_alloc_reserved(u64 size, u64 align);
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 71c11ad..c1dc1cc 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -840,6 +840,13 @@ unsigned long __init e820__end_of_ram_pfn(void)
 	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_RAM);
 }
 
+#ifdef CONFIG_ZONE_NVM
+unsigned long __init e820__end_of_nvm_pfn(void)
+{
+	return e820_end_pfn(MAX_ARCH_PFN, E820_TYPE_PMEM);
+}
+#endif
+
 unsigned long __init e820__end_of_low_ram_pfn(void)
 {
 	return e820_end_pfn(1UL << (32 - PAGE_SHIFT), E820_TYPE_RAM);
@@ -1246,11 +1253,22 @@ void __init e820__memblock_setup(void)
 		end = entry->addr + entry->size;
 		if (end != (resource_size_t)end)
 			continue;
-
+#ifdef CONFIG_ZONE_NVM
+		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN &&
+								entry->type != E820_TYPE_PMEM)
+#else
 		if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
+#endif
 			continue;
 
 		memblock_add(entry->addr, entry->size);
+
+#ifdef CONFIG_ZONE_NVM
+		if (entry->type == E820_TYPE_PMEM) {
+			/* Mark this region with PMEM flags */
+			memblock_mark_nvdimm(entry->addr, entry->size);
+		}
+#endif
 	}
 
 	/* Throw away partial pages: */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4c616be..84c4ddb 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1032,7 +1032,15 @@ void __init setup_arch(char **cmdline_p)
 	 * partially used pages are not usable - thus
 	 * we are rounding upwards:
 	 */
+#ifdef CONFIG_ZONE_NVM
+	max_pfn = e820__end_of_nvm_pfn();
+	if (!max_pfn) {
+		printk(KERN_INFO "No physical NVDIMM has been found\n");
+		max_pfn = e820__end_of_ram_pfn();
+	}
+#else
 	max_pfn = e820__end_of_ram_pfn();
+#endif
 
 	/* update e820 for memory not covered by WB MTRRs */
 	mtrr_bp_init();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External]  [RFC PATCH v1 2/6] mm/page_alloc.c: get pfn range with flags of memblock
       [not found] ` <1525746628-114136-3-git-send-email-yehs1@lenovo.com>
@ 2018-05-08  2:32     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:32 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka

This is used to expand the interface of get_pfn_range_for_nid with
flags of memblock, so mm can get pfn range with special flags.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 include/linux/mm.h |  4 ++++
 mm/page_alloc.c    | 17 ++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..8abf9c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2046,6 +2046,10 @@ extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
 			unsigned long *start_pfn, unsigned long *end_pfn);
+extern void get_pfn_range_for_nid_with_flags(unsigned int nid,
+					     unsigned long *start_pfn,
+					     unsigned long *end_pfn,
+					     unsigned long flags);
 extern unsigned long find_min_pfn_with_active_regions(void);
 extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1741dd2..266c065 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5705,13 +5705,28 @@ void __init sparse_memory_present_with_active_regions(int nid)
 void __meminit get_pfn_range_for_nid(unsigned int nid,
 			unsigned long *start_pfn, unsigned long *end_pfn)
 {
+	get_pfn_range_for_nid_with_flags(nid, start_pfn, end_pfn,
+					 MEMBLOCK_MAX_TYPE);
+}
+
+/*
+ * If MAX_NUMNODES, includes all node memmory regions.
+ * If MEMBLOCK_MAX_TYPE, includes all memory regions with or without Flags.
+ */
+
+void __meminit get_pfn_range_for_nid_with_flags(unsigned int nid,
+						unsigned long *start_pfn,
+						unsigned long *end_pfn,
+						unsigned long flags)
+{
 	unsigned long this_start_pfn, this_end_pfn;
 	int i;
 
 	*start_pfn = -1UL;
 	*end_pfn = 0;
 
-	for_each_mem_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) {
+	for_each_mem_pfn_range_with_flags(i, nid, &this_start_pfn,
+					  &this_end_pfn, NULL, flags) {
 		*start_pfn = min(*start_pfn, this_start_pfn);
 		*end_pfn = max(*end_pfn, this_end_pfn);
 	}
-- 
1.8.3.1

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External]  [RFC PATCH v1 2/6] mm/page_alloc.c: get pfn range with flags of memblock
@ 2018-05-08  2:32     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:32 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm

This is used to expand the interface of get_pfn_range_for_nid with
flags of memblock, so mm can get pfn range with special flags.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 include/linux/mm.h |  4 ++++
 mm/page_alloc.c    | 17 ++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..8abf9c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2046,6 +2046,10 @@ extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
 			unsigned long *start_pfn, unsigned long *end_pfn);
+extern void get_pfn_range_for_nid_with_flags(unsigned int nid,
+					     unsigned long *start_pfn,
+					     unsigned long *end_pfn,
+					     unsigned long flags);
 extern unsigned long find_min_pfn_with_active_regions(void);
 extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1741dd2..266c065 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5705,13 +5705,28 @@ void __init sparse_memory_present_with_active_regions(int nid)
 void __meminit get_pfn_range_for_nid(unsigned int nid,
 			unsigned long *start_pfn, unsigned long *end_pfn)
 {
+	get_pfn_range_for_nid_with_flags(nid, start_pfn, end_pfn,
+					 MEMBLOCK_MAX_TYPE);
+}
+
+/*
+ * If MAX_NUMNODES, includes all node memmory regions.
+ * If MEMBLOCK_MAX_TYPE, includes all memory regions with or without Flags.
+ */
+
+void __meminit get_pfn_range_for_nid_with_flags(unsigned int nid,
+						unsigned long *start_pfn,
+						unsigned long *end_pfn,
+						unsigned long flags)
+{
 	unsigned long this_start_pfn, this_end_pfn;
 	int i;
 
 	*start_pfn = -1UL;
 	*end_pfn = 0;
 
-	for_each_mem_pfn_range(i, nid, &this_start_pfn, &this_end_pfn, NULL) {
+	for_each_mem_pfn_range_with_flags(i, nid, &this_start_pfn,
+					  &this_end_pfn, NULL, flags) {
 		*start_pfn = min(*start_pfn, this_start_pfn);
 		*end_pfn = max(*end_pfn, this_end_pfn);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External]  [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
       [not found] ` <1525746628-114136-4-git-send-email-yehs1@lenovo.com>
@ 2018-05-08  2:33     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:33 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka

Expand ZONE_NVM into enum zone_type, and create GFP_NVM
which represents gfp_t flag for NVM zone.

Because there is no lower plain integer GFP bitmask can be
used for ___GFP_NVM, a workable way is to get space from
GFP_ZONE_BAD to fill ZONE_NVM into GFP_ZONE_TABLE.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 include/linux/gfp.h    | 57 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/mmzone.h |  3 +++
 mm/Kconfig             | 16 ++++++++++++++
 mm/page_alloc.c        |  3 +++
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..9e4d867 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,9 @@
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_ZONE_NVM
+#define ___GFP_NVM		0x4000000u
+#endif
 #ifdef CONFIG_LOCKDEP
 #define ___GFP_NOLOCKDEP	0x2000000u
 #else
@@ -57,7 +60,12 @@
 #define __GFP_HIGHMEM	((__force gfp_t)___GFP_HIGHMEM)
 #define __GFP_DMA32	((__force gfp_t)___GFP_DMA32)
 #define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
+#ifdef CONFIG_ZONE_NVM
+#define __GFP_NVM	((__force gfp_t)___GFP_NVM)  /* ZONE_NVM allowed */
+#define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE|__GFP_NVM)
+#else
 #define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+#endif
 
 /*
  * Page mobility and placement hints
@@ -205,7 +213,8 @@
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP) + \
+				(IS_ENABLED(CONFIG_ZONE_NVM) << 1))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
@@ -283,6 +292,9 @@
 #define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#ifdef CONFIG_ZONE_NVM
+#define GFP_NVM		__GFP_NVM
+#endif
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -342,7 +354,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  *       0x0    => NORMAL
  *       0x1    => DMA or NORMAL
  *       0x2    => HIGHMEM or NORMAL
- *       0x3    => BAD (DMA+HIGHMEM)
+ *       0x3    => NVM (DMA+HIGHMEM), now it is used by NVDIMM zone
  *       0x4    => DMA32 or DMA or NORMAL
  *       0x5    => BAD (DMA+DMA32)
  *       0x6    => BAD (HIGHMEM+DMA32)
@@ -370,6 +382,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
 #endif
 
+#ifdef CONFIG_ZONE_NVM
+#define ___GFP_NVM_BIT (___GFP_DMA | ___GFP_HIGHMEM)
+#define GFP_ZONE_TABLE ( \
+	((__force unsigned long)ZONE_NORMAL <<				       \
+			0 * GFP_ZONES_SHIFT)				       \
+	| ((__force unsigned long)OPT_ZONE_DMA <<			       \
+			___GFP_DMA * GFP_ZONES_SHIFT)			       \
+	| ((__force unsigned long)OPT_ZONE_HIGHMEM <<			       \
+			___GFP_HIGHMEM * GFP_ZONES_SHIFT)		       \
+	| ((__force unsigned long)OPT_ZONE_DMA32 <<			       \
+			___GFP_DMA32 * GFP_ZONES_SHIFT)			       \
+	| ((__force unsigned long)ZONE_NORMAL <<			       \
+			___GFP_MOVABLE * GFP_ZONES_SHIFT)		       \
+	| ((__force unsigned long)OPT_ZONE_DMA <<			       \
+			(___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)       \
+	| ((__force unsigned long)ZONE_MOVABLE <<			       \
+			(___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)   \
+	| ((__force unsigned long)OPT_ZONE_DMA32 <<			       \
+			(___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)     \
+	| ((__force unsigned long)ZONE_NVM <<				       \
+			___GFP_NVM_BIT * GFP_ZONES_SHIFT)                      \
+)
+#else
 #define GFP_ZONE_TABLE ( \
 	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)				       \
 	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)		       \
@@ -380,6 +415,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
 	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
 )
+#endif
 
 /*
  * GFP_ZONE_BAD is a bitmap for all combinations of __GFP_DMA, __GFP_DMA32
@@ -387,6 +423,17 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  * entry starting with bit 0. Bit is set if the combination is not
  * allowed.
  */
+#ifdef CONFIG_ZONE_NVM
+#define GFP_ZONE_BAD ( \
+	1 << (___GFP_DMA | ___GFP_DMA32)				      \
+	| 1 << (___GFP_DMA32 | ___GFP_HIGHMEM)				      \
+	| 1 << (___GFP_DMA | ___GFP_DMA32 | ___GFP_HIGHMEM)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_HIGHMEM | ___GFP_DMA)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM)  \
+)
+#else
 #define GFP_ZONE_BAD ( \
 	1 << (___GFP_DMA | ___GFP_HIGHMEM)				      \
 	| 1 << (___GFP_DMA | ___GFP_DMA32)				      \
@@ -397,12 +444,16 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM)		      \
 	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM)  \
 )
+#endif
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
 	enum zone_type z;
 	int bit = (__force int) (flags & GFP_ZONEMASK);
-
+#ifdef CONFIG_ZONE_NVM
+	if (bit & __GFP_NVM)
+		bit = (__force int)___GFP_NVM_BIT;
+#endif
 	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
 					 ((1 << GFP_ZONES_SHIFT) - 1);
 	VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a69..f38e4a0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -345,6 +345,9 @@ enum zone_type {
 	 */
 	ZONE_HIGHMEM,
 #endif
+#ifdef CONFIG_ZONE_NVM
+	ZONE_NVM,
+#endif
 	ZONE_MOVABLE,
 #ifdef CONFIG_ZONE_DEVICE
 	ZONE_DEVICE,
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8f..5fe1f63 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -687,6 +687,22 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config ZONE_NVM
+	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
+	depends on NUMA && X86_64
+	depends on HAVE_MEMBLOCK_NODE_MAP
+	depends on HAVE_MEMBLOCK
+	depends on !IA32_EMULATION
+	default n
+
+	help
+	  This option allows you to use memory management subsystem to manage
+	  NVDIMM (pmem). With it mm can arrange NVDIMMs into real physical zones
+	  like NORMAL and DMA32. That means buddy system and swap can be used
+	  directly to NVDIMM zone. This feature is beneficial to recover
+	  dirty pages from power fail or system crash by storing write cache
+	  to NVDIMM zone.
+
 config ARCH_HAS_HMM
 	bool
 	default y
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 266c065..d8bd20d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -228,6 +228,9 @@ bool pm_suspended_storage(void)
 	 "DMA32",
 #endif
 	 "Normal",
+#ifdef CONFIG_ZONE_NVM
+	 "NVM",
+#endif
 #ifdef CONFIG_HIGHMEM
 	 "HighMem",
 #endif
-- 
1.8.3.1

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External]  [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-08  2:33     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:33 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm

Expand ZONE_NVM into enum zone_type, and create GFP_NVM
which represents gfp_t flag for NVM zone.

Because there is no lower plain integer GFP bitmask can be
used for ___GFP_NVM, a workable way is to get space from
GFP_ZONE_BAD to fill ZONE_NVM into GFP_ZONE_TABLE.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 include/linux/gfp.h    | 57 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/mmzone.h |  3 +++
 mm/Kconfig             | 16 ++++++++++++++
 mm/page_alloc.c        |  3 +++
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..9e4d867 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,9 @@
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_ZONE_NVM
+#define ___GFP_NVM		0x4000000u
+#endif
 #ifdef CONFIG_LOCKDEP
 #define ___GFP_NOLOCKDEP	0x2000000u
 #else
@@ -57,7 +60,12 @@
 #define __GFP_HIGHMEM	((__force gfp_t)___GFP_HIGHMEM)
 #define __GFP_DMA32	((__force gfp_t)___GFP_DMA32)
 #define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
+#ifdef CONFIG_ZONE_NVM
+#define __GFP_NVM	((__force gfp_t)___GFP_NVM)  /* ZONE_NVM allowed */
+#define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE|__GFP_NVM)
+#else
 #define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+#endif
 
 /*
  * Page mobility and placement hints
@@ -205,7 +213,8 @@
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP) + \
+				(IS_ENABLED(CONFIG_ZONE_NVM) << 1))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
@@ -283,6 +292,9 @@
 #define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
+#ifdef CONFIG_ZONE_NVM
+#define GFP_NVM		__GFP_NVM
+#endif
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -342,7 +354,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  *       0x0    => NORMAL
  *       0x1    => DMA or NORMAL
  *       0x2    => HIGHMEM or NORMAL
- *       0x3    => BAD (DMA+HIGHMEM)
+ *       0x3    => NVM (DMA+HIGHMEM), now it is used by NVDIMM zone
  *       0x4    => DMA32 or DMA or NORMAL
  *       0x5    => BAD (DMA+DMA32)
  *       0x6    => BAD (HIGHMEM+DMA32)
@@ -370,6 +382,29 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 #error GFP_ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
 #endif
 
+#ifdef CONFIG_ZONE_NVM
+#define ___GFP_NVM_BIT (___GFP_DMA | ___GFP_HIGHMEM)
+#define GFP_ZONE_TABLE ( \
+	((__force unsigned long)ZONE_NORMAL <<				       \
+			0 * GFP_ZONES_SHIFT)				       \
+	| ((__force unsigned long)OPT_ZONE_DMA <<			       \
+			___GFP_DMA * GFP_ZONES_SHIFT)			       \
+	| ((__force unsigned long)OPT_ZONE_HIGHMEM <<			       \
+			___GFP_HIGHMEM * GFP_ZONES_SHIFT)		       \
+	| ((__force unsigned long)OPT_ZONE_DMA32 <<			       \
+			___GFP_DMA32 * GFP_ZONES_SHIFT)			       \
+	| ((__force unsigned long)ZONE_NORMAL <<			       \
+			___GFP_MOVABLE * GFP_ZONES_SHIFT)		       \
+	| ((__force unsigned long)OPT_ZONE_DMA <<			       \
+			(___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)       \
+	| ((__force unsigned long)ZONE_MOVABLE <<			       \
+			(___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)   \
+	| ((__force unsigned long)OPT_ZONE_DMA32 <<			       \
+			(___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)     \
+	| ((__force unsigned long)ZONE_NVM <<				       \
+			___GFP_NVM_BIT * GFP_ZONES_SHIFT)                      \
+)
+#else
 #define GFP_ZONE_TABLE ( \
 	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)				       \
 	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)		       \
@@ -380,6 +415,7 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
 	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
 )
+#endif
 
 /*
  * GFP_ZONE_BAD is a bitmap for all combinations of __GFP_DMA, __GFP_DMA32
@@ -387,6 +423,17 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
  * entry starting with bit 0. Bit is set if the combination is not
  * allowed.
  */
+#ifdef CONFIG_ZONE_NVM
+#define GFP_ZONE_BAD ( \
+	1 << (___GFP_DMA | ___GFP_DMA32)				      \
+	| 1 << (___GFP_DMA32 | ___GFP_HIGHMEM)				      \
+	| 1 << (___GFP_DMA | ___GFP_DMA32 | ___GFP_HIGHMEM)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_HIGHMEM | ___GFP_DMA)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM)		      \
+	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM)  \
+)
+#else
 #define GFP_ZONE_BAD ( \
 	1 << (___GFP_DMA | ___GFP_HIGHMEM)				      \
 	| 1 << (___GFP_DMA | ___GFP_DMA32)				      \
@@ -397,12 +444,16 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_HIGHMEM)		      \
 	| 1 << (___GFP_MOVABLE | ___GFP_DMA32 | ___GFP_DMA | ___GFP_HIGHMEM)  \
 )
+#endif
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
 	enum zone_type z;
 	int bit = (__force int) (flags & GFP_ZONEMASK);
-
+#ifdef CONFIG_ZONE_NVM
+	if (bit & __GFP_NVM)
+		bit = (__force int)___GFP_NVM_BIT;
+#endif
 	z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &
 					 ((1 << GFP_ZONES_SHIFT) - 1);
 	VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a69..f38e4a0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -345,6 +345,9 @@ enum zone_type {
 	 */
 	ZONE_HIGHMEM,
 #endif
+#ifdef CONFIG_ZONE_NVM
+	ZONE_NVM,
+#endif
 	ZONE_MOVABLE,
 #ifdef CONFIG_ZONE_DEVICE
 	ZONE_DEVICE,
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8f..5fe1f63 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -687,6 +687,22 @@ config ZONE_DEVICE
 
 	  If FS_DAX is enabled, then say Y.
 
+config ZONE_NVM
+	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
+	depends on NUMA && X86_64
+	depends on HAVE_MEMBLOCK_NODE_MAP
+	depends on HAVE_MEMBLOCK
+	depends on !IA32_EMULATION
+	default n
+
+	help
+	  This option allows you to use memory management subsystem to manage
+	  NVDIMM (pmem). With it mm can arrange NVDIMMs into real physical zones
+	  like NORMAL and DMA32. That means buddy system and swap can be used
+	  directly to NVDIMM zone. This feature is beneficial to recover
+	  dirty pages from power fail or system crash by storing write cache
+	  to NVDIMM zone.
+
 config ARCH_HAS_HMM
 	bool
 	default y
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 266c065..d8bd20d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -228,6 +228,9 @@ bool pm_suspended_storage(void)
 	 "DMA32",
 #endif
 	 "Normal",
+#ifdef CONFIG_ZONE_NVM
+	 "NVM",
+#endif
 #ifdef CONFIG_HIGHMEM
 	 "HighMem",
 #endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External] [RFC PATCH v1 5/6] mm: get zone spanned pages separately for DRAM and NVDIMM
       [not found] ` <1525746628-114136-6-git-send-email-yehs1@lenovo.com>
@ 2018-05-08  2:34     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:34 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka

DRAM and NVDIMM are divided into separate zones, thus NVM
zone is dedicated for NVDIMMs.

During zone_spanned_pages_in_node, spanned pages of zones
are calculated separately for DRAM and NVDIMM by flags
MEMBLOCK_NONE and MEMBLOCK_NVDIMM.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 mm/nobootmem.c  |  5 +++--
 mm/page_alloc.c | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 9b02fda..19b5291 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -143,8 +143,9 @@ static unsigned long __init free_low_memory_core_early(void)
 	 *  because in some case like Node0 doesn't have RAM installed
 	 *  low ram will be on Node1
 	 */
-	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
-				NULL)
+	for_each_free_mem_range(i, NUMA_NO_NODE,
+				MEMBLOCK_NONE | MEMBLOCK_NVDIMM,
+				&start, &end, NULL)
 		count += __free_memory_core(start, end);
 
 	return count;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8bd20d..3fd0d95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4221,6 +4221,11 @@ static inline void finalise_ac(gfp_t gfp_mask,
 	 * also used as the starting point for the zonelist iterator. It
 	 * may get reset for allocations that ignore memory policies.
 	 */
+#ifdef CONFIG_ZONE_NVM
+	/* Bypass ZONE_NVM for Normal alloctions */
+	if (ac->high_zoneidx > ZONE_NVM)
+		ac->high_zoneidx = ZONE_NORMAL;
+#endif
 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
 }
@@ -5808,6 +5813,10 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
 					unsigned long *zone_end_pfn,
 					unsigned long *ignored)
 {
+#ifdef CONFIG_ZONE_NVM
+	unsigned long start_pfn, end_pfn;
+#endif
+
 	/* When hotadd a new node from cpu_up(), the node should be empty */
 	if (!node_start_pfn && !node_end_pfn)
 		return 0;
@@ -5815,6 +5824,26 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
 	/* Get the start and end of the zone */
 	*zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
 	*zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+#ifdef CONFIG_ZONE_NVM
+	/*
+	 * Use zone_type to adjust zone size again.
+	 */
+	if (zone_type == ZONE_NVM) {
+		get_pfn_range_for_nid_with_flags(nid, &start_pfn, &end_pfn,
+							MEMBLOCK_NVDIMM);
+	} else {
+		get_pfn_range_for_nid_with_flags(nid, &start_pfn, &end_pfn,
+							MEMBLOCK_NONE);
+	}
+
+	if (*zone_end_pfn < start_pfn || *zone_start_pfn > end_pfn)
+		return 0;
+	/* Move the zone boundaries inside the possile_pfn if necessary */
+	*zone_end_pfn = min(*zone_end_pfn, end_pfn);
+	*zone_start_pfn = max(*zone_start_pfn, start_pfn);
+#endif
+
 	adjust_zone_range_for_zone_movable(nid, zone_type,
 				node_start_pfn, node_end_pfn,
 				zone_start_pfn, zone_end_pfn);
@@ -6680,6 +6709,17 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 		start_pfn = end_pfn;
 	}
 
+#ifdef CONFIG_ZONE_NVM
+	/*
+	 * Adjust nvm zone included in normal zone
+	 */
+	get_pfn_range_for_nid_with_flags(MAX_NUMNODES, &start_pfn, &end_pfn,
+							    MEMBLOCK_NVDIMM);
+
+	arch_zone_lowest_possible_pfn[ZONE_NVM] = start_pfn;
+	arch_zone_highest_possible_pfn[ZONE_NVM] = end_pfn;
+#endif
+
 	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
 	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
 	find_zone_movable_pfns_for_nodes();
-- 
1.8.3.1

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External]  [RFC PATCH v1 5/6] mm: get zone spanned pages separately for DRAM and NVDIMM
@ 2018-05-08  2:34     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:34 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm

DRAM and NVDIMM are divided into separate zones, thus NVM
zone is dedicated for NVDIMMs.

During zone_spanned_pages_in_node, spanned pages of zones
are calculated separately for DRAM and NVDIMM by flags
MEMBLOCK_NONE and MEMBLOCK_NVDIMM.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 mm/nobootmem.c  |  5 +++--
 mm/page_alloc.c | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 9b02fda..19b5291 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -143,8 +143,9 @@ static unsigned long __init free_low_memory_core_early(void)
 	 *  because in some case like Node0 doesn't have RAM installed
 	 *  low ram will be on Node1
 	 */
-	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end,
-				NULL)
+	for_each_free_mem_range(i, NUMA_NO_NODE,
+				MEMBLOCK_NONE | MEMBLOCK_NVDIMM,
+				&start, &end, NULL)
 		count += __free_memory_core(start, end);
 
 	return count;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8bd20d..3fd0d95 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4221,6 +4221,11 @@ static inline void finalise_ac(gfp_t gfp_mask,
 	 * also used as the starting point for the zonelist iterator. It
 	 * may get reset for allocations that ignore memory policies.
 	 */
+#ifdef CONFIG_ZONE_NVM
+	/* Bypass ZONE_NVM for Normal alloctions */
+	if (ac->high_zoneidx > ZONE_NVM)
+		ac->high_zoneidx = ZONE_NORMAL;
+#endif
 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
 }
@@ -5808,6 +5813,10 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
 					unsigned long *zone_end_pfn,
 					unsigned long *ignored)
 {
+#ifdef CONFIG_ZONE_NVM
+	unsigned long start_pfn, end_pfn;
+#endif
+
 	/* When hotadd a new node from cpu_up(), the node should be empty */
 	if (!node_start_pfn && !node_end_pfn)
 		return 0;
@@ -5815,6 +5824,26 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
 	/* Get the start and end of the zone */
 	*zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
 	*zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+
+#ifdef CONFIG_ZONE_NVM
+	/*
+	 * Use zone_type to adjust zone size again.
+	 */
+	if (zone_type == ZONE_NVM) {
+		get_pfn_range_for_nid_with_flags(nid, &start_pfn, &end_pfn,
+							MEMBLOCK_NVDIMM);
+	} else {
+		get_pfn_range_for_nid_with_flags(nid, &start_pfn, &end_pfn,
+							MEMBLOCK_NONE);
+	}
+
+	if (*zone_end_pfn < start_pfn || *zone_start_pfn > end_pfn)
+		return 0;
+	/* Move the zone boundaries inside the possile_pfn if necessary */
+	*zone_end_pfn = min(*zone_end_pfn, end_pfn);
+	*zone_start_pfn = max(*zone_start_pfn, start_pfn);
+#endif
+
 	adjust_zone_range_for_zone_movable(nid, zone_type,
 				node_start_pfn, node_end_pfn,
 				zone_start_pfn, zone_end_pfn);
@@ -6680,6 +6709,17 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 		start_pfn = end_pfn;
 	}
 
+#ifdef CONFIG_ZONE_NVM
+	/*
+	 * Adjust nvm zone included in normal zone
+	 */
+	get_pfn_range_for_nid_with_flags(MAX_NUMNODES, &start_pfn, &end_pfn,
+							    MEMBLOCK_NVDIMM);
+
+	arch_zone_lowest_possible_pfn[ZONE_NVM] = start_pfn;
+	arch_zone_highest_possible_pfn[ZONE_NVM] = end_pfn;
+#endif
+
 	/* Find the PFNs that ZONE_MOVABLE begins at in each node */
 	memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
 	find_zone_movable_pfns_for_nodes();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External] [RFC PATCH v1 6/6] arch/x86/mm: create page table mapping for DRAM and NVDIMM both
       [not found] ` <1525746628-114136-7-git-send-email-yehs1@lenovo.com>
@ 2018-05-08  2:35     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:35 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka

Create PTE, PMD, PUD and P4D levels page table mapping for physical
addresses of DRAM and NVDIMM both. Here E820_TYPE_PMEM represents
the region of e820_table.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 arch/x86/mm/init_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index af11a28..c03c2091 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -420,6 +420,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_pte(pte, __pte(0));
@@ -475,6 +479,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_pmd(pmd, __pmd(0));
@@ -561,6 +569,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_pud(pud, __pud(0));
@@ -647,6 +659,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_p4d(p4d, __p4d(0));
-- 
1.8.3.1

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [External]  [RFC PATCH v1 6/6] arch/x86/mm: create page table mapping for DRAM and NVDIMM both
@ 2018-05-08  2:35     ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-08  2:35 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm

Create PTE, PMD, PUD and P4D levels page table mapping for physical
addresses of DRAM and NVDIMM both. Here E820_TYPE_PMEM represents
the region of e820_table.

Signed-off-by: Huaisheng Ye <yehs1@lenovo.com>
Signed-off-by: Ocean He <hehy1@lenovo.com>
---
 arch/x86/mm/init_64.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index af11a28..c03c2091 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -420,6 +420,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_pte(pte, __pte(0));
@@ -475,6 +479,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & PMD_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_pmd(pmd, __pmd(0));
@@ -561,6 +569,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & PUD_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_pud(pud, __pud(0));
@@ -647,6 +659,10 @@ void __init cleanup_highmap(void)
 			if (!after_bootmem &&
 			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
 					     E820_TYPE_RAM) &&
+#ifdef CONFIG_ZONE_NVM
+			    !e820__mapped_any(paddr & PAGE_MASK, paddr_next,
+					     E820_TYPE_PMEM) &&
+#endif
 			    !e820__mapped_any(paddr & P4D_MASK, paddr_next,
 					     E820_TYPE_RESERVED_KERN))
 				set_p4d(p4d, __p4d(0));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  2018-05-08  2:33     ` Huaisheng HS1 Ye
@ 2018-05-08  4:43       ` Randy Dunlap
  -1 siblings, 0 replies; 31+ messages in thread
From: Randy Dunlap @ 2018-05-08  4:43 UTC (permalink / raw)
  To: Huaisheng HS1 Ye, akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka

On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c782e8f..5fe1f63 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -687,6 +687,22 @@ config ZONE_DEVICE
>  
> +config ZONE_NVM
> +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> +	depends on NUMA && X86_64

Hi,
I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
(i.e., UMA) configs?

Thanks.

> +	depends on HAVE_MEMBLOCK_NODE_MAP
> +	depends on HAVE_MEMBLOCK
> +	depends on !IA32_EMULATION
> +	default n
> +
> +	help
> +	  This option allows you to use memory management subsystem to manage
> +	  NVDIMM (pmem). With it mm can arrange NVDIMMs into real physical zones
> +	  like NORMAL and DMA32. That means buddy system and swap can be used
> +	  directly to NVDIMM zone. This feature is beneficial to recover
> +	  dirty pages from power fail or system crash by storing write cache
> +	  to NVDIMM zone.



-- 
~Randy
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-08  4:43       ` Randy Dunlap
  0 siblings, 0 replies; 31+ messages in thread
From: Randy Dunlap @ 2018-05-08  4:43 UTC (permalink / raw)
  To: Huaisheng HS1 Ye, akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm

On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c782e8f..5fe1f63 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -687,6 +687,22 @@ config ZONE_DEVICE
>  
> +config ZONE_NVM
> +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> +	depends on NUMA && X86_64

Hi,
I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
(i.e., UMA) configs?

Thanks.

> +	depends on HAVE_MEMBLOCK_NODE_MAP
> +	depends on HAVE_MEMBLOCK
> +	depends on !IA32_EMULATION
> +	default n
> +
> +	help
> +	  This option allows you to use memory management subsystem to manage
> +	  NVDIMM (pmem). With it mm can arrange NVDIMMs into real physical zones
> +	  like NORMAL and DMA32. That means buddy system and swap can be used
> +	  directly to NVDIMM zone. This feature is beneficial to recover
> +	  dirty pages from power fail or system crash by storing write cache
> +	  to NVDIMM zone.



-- 
~Randy

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  2018-05-08  4:43       ` Randy Dunlap
@ 2018-05-09  4:22         ` Huaisheng HS1 Ye
  -1 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-09  4:22 UTC (permalink / raw)
  To: Randy Dunlap, akpm, linux-mm
  Cc: mhocko, linux-kernel, Ocean HY1 He, penguin-kernel,
	NingTing Cheng, linux-nvdimm, pasha.tatashin, willy,
	alexander.levin, hannes, colyli, mgorman, vbabka


> On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index c782e8f..5fe1f63 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> >
> > +config ZONE_NVM
> > +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > +	depends on NUMA && X86_64
> 
> Hi,
> I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> (i.e., UMA) configs?
> 
I wrote these patches with two sockets testing platform, and there are two DDRs and two NVDIMMs have been installed to it.
So, for every socket it has one DDR and one NVDIMM with it. Here is memory region from memblock, you can get its distribution.

 435 [    0.000000] Zone ranges:
 436 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
 437 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
 438 [    0.000000]   Normal   [mem 0x0000000100000000-0x00000046bfffffff]
 439 [    0.000000]   NVM      [mem 0x0000000440000000-0x00000046bfffffff]
 440 [    0.000000]   Device   empty
 441 [    0.000000] Movable zone start for each node
 442 [    0.000000] Early memory node ranges
 443 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
 444 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
 445 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
 446 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
 447 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
 448 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]	Normal 0
 449 [    0.000000]   node   0: [mem 0x0000000440000000-0x000000237fffffff]	NVDIMM 0
 450 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]	Normal 1
 451 [    0.000000]   node   1: [mem 0x0000002780000000-0x00000046bfffffff]	NVDIMM 1

If we disable NUMA, there is a result as Normal an NVDIMM zones will be overlapping with each other.
Current mm treats all memory regions equally, it divides zones just by size, like 16M for DMA, 4G for DMA32, and others above for Normal.
The spanned range of all zones couldn't be overlapped.

If we enable NUMA, for every socket its DDR and NVDIMM are separated, you can find that NVDIMM region always behind Normal zone.

Sincerely,
Huaisheng Ye 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-09  4:22         ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-09  4:22 UTC (permalink / raw)
  To: Randy Dunlap, akpm, linux-mm
  Cc: mhocko, willy, vbabka, mgorman, pasha.tatashin, alexander.levin,
	hannes, penguin-kernel, colyli, NingTing Cheng, Ocean HY1 He,
	linux-kernel, linux-nvdimm


> On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index c782e8f..5fe1f63 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> >
> > +config ZONE_NVM
> > +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > +	depends on NUMA && X86_64
> 
> Hi,
> I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> (i.e., UMA) configs?
> 
I wrote these patches with two sockets testing platform, and there are two DDRs and two NVDIMMs have been installed to it.
So, for every socket it has one DDR and one NVDIMM with it. Here is memory region from memblock, you can get its distribution.

 435 [    0.000000] Zone ranges:
 436 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
 437 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
 438 [    0.000000]   Normal   [mem 0x0000000100000000-0x00000046bfffffff]
 439 [    0.000000]   NVM      [mem 0x0000000440000000-0x00000046bfffffff]
 440 [    0.000000]   Device   empty
 441 [    0.000000] Movable zone start for each node
 442 [    0.000000] Early memory node ranges
 443 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
 444 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
 445 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
 446 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
 447 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
 448 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]	Normal 0
 449 [    0.000000]   node   0: [mem 0x0000000440000000-0x000000237fffffff]	NVDIMM 0
 450 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]	Normal 1
 451 [    0.000000]   node   1: [mem 0x0000002780000000-0x00000046bfffffff]	NVDIMM 1

If we disable NUMA, there is a result as Normal an NVDIMM zones will be overlapping with each other.
Current mm treats all memory regions equally, it divides zones just by size, like 16M for DMA, 4G for DMA32, and others above for Normal.
The spanned range of all zones couldn't be overlapped.

If we enable NUMA, for every socket its DDR and NVDIMM are separated, you can find that NVDIMM region always behind Normal zone.

Sincerely,
Huaisheng Ye 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  2018-05-09  4:22         ` Huaisheng HS1 Ye
@ 2018-05-09 11:47           ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-09 11:47 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: linux-kernel, Ocean HY1 He, penguin-kernel, NingTing Cheng,
	Randy Dunlap, pasha.tatashin, willy, alexander.levin, linux-mm,
	hannes, akpm, colyli, mgorman, vbabka, linux-nvdimm

On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
> 
> > On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index c782e8f..5fe1f63 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> > >
> > > +config ZONE_NVM
> > > +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > > +	depends on NUMA && X86_64
> > 
> > Hi,
> > I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> > (i.e., UMA) configs?
> > 
> I wrote these patches with two sockets testing platform, and there are two DDRs and two NVDIMMs have been installed to it.
> So, for every socket it has one DDR and one NVDIMM with it. Here is memory region from memblock, you can get its distribution.
> 
>  435 [    0.000000] Zone ranges:
>  436 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>  437 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
>  438 [    0.000000]   Normal   [mem 0x0000000100000000-0x00000046bfffffff]
>  439 [    0.000000]   NVM      [mem 0x0000000440000000-0x00000046bfffffff]
>  440 [    0.000000]   Device   empty
>  441 [    0.000000] Movable zone start for each node
>  442 [    0.000000] Early memory node ranges
>  443 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
>  444 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
>  445 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
>  446 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
>  447 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
>  448 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]	Normal 0
>  449 [    0.000000]   node   0: [mem 0x0000000440000000-0x000000237fffffff]	NVDIMM 0
>  450 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]	Normal 1
>  451 [    0.000000]   node   1: [mem 0x0000002780000000-0x00000046bfffffff]	NVDIMM 1
> 
> If we disable NUMA, there is a result as Normal an NVDIMM zones will be overlapping with each other.
> Current mm treats all memory regions equally, it divides zones just by size, like 16M for DMA, 4G for DMA32, and others above for Normal.
> The spanned range of all zones couldn't be overlapped.

No, this is not correct. Zones can overlap.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-09 11:47           ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-09 11:47 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Randy Dunlap, akpm, linux-mm, willy, vbabka, mgorman,
	pasha.tatashin, alexander.levin, hannes, penguin-kernel, colyli,
	NingTing Cheng, Ocean HY1 He, linux-kernel, linux-nvdimm

On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
> 
> > On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index c782e8f..5fe1f63 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> > >
> > > +config ZONE_NVM
> > > +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > > +	depends on NUMA && X86_64
> > 
> > Hi,
> > I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> > (i.e., UMA) configs?
> > 
> I wrote these patches with two sockets testing platform, and there are two DDRs and two NVDIMMs have been installed to it.
> So, for every socket it has one DDR and one NVDIMM with it. Here is memory region from memblock, you can get its distribution.
> 
>  435 [    0.000000] Zone ranges:
>  436 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
>  437 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
>  438 [    0.000000]   Normal   [mem 0x0000000100000000-0x00000046bfffffff]
>  439 [    0.000000]   NVM      [mem 0x0000000440000000-0x00000046bfffffff]
>  440 [    0.000000]   Device   empty
>  441 [    0.000000] Movable zone start for each node
>  442 [    0.000000] Early memory node ranges
>  443 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
>  444 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
>  445 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
>  446 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
>  447 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
>  448 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]	Normal 0
>  449 [    0.000000]   node   0: [mem 0x0000000440000000-0x000000237fffffff]	NVDIMM 0
>  450 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]	Normal 1
>  451 [    0.000000]   node   1: [mem 0x0000002780000000-0x00000046bfffffff]	NVDIMM 1
> 
> If we disable NUMA, there is a result as Normal an NVDIMM zones will be overlapping with each other.
> Current mm treats all memory regions equally, it divides zones just by size, like 16M for DMA, 4G for DMA32, and others above for Normal.
> The spanned range of all zones couldn't be overlapped.

No, this is not correct. Zones can overlap.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  2018-05-09 11:47           ` Michal Hocko
@ 2018-05-09 14:04             ` Huaisheng HS1 Ye
  -1 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-09 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Ocean HY1 He, penguin-kernel, NingTing Cheng,
	Randy Dunlap, pasha.tatashin, willy, alexander.levin, linux-mm,
	hannes, akpm, colyli, mgorman, vbabka, linux-nvdimm

> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Michal Hocko
> 
> On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
> >
> > > On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index c782e8f..5fe1f63 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> > > >
> > > > +config ZONE_NVM
> > > > +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > > > +	depends on NUMA && X86_64
> > >
> > > Hi,
> > > I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> > > (i.e., UMA) configs?
> > >
> > I wrote these patches with two sockets testing platform, and there are two DDRs and
> two NVDIMMs have been installed to it.
> > So, for every socket it has one DDR and one NVDIMM with it. Here is memory region
> from memblock, you can get its distribution.
> >
> >  435 [    0.000000] Zone ranges:
> >  436 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> >  437 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> >  438 [    0.000000]   Normal   [mem 0x0000000100000000-0x00000046bfffffff]
> >  439 [    0.000000]   NVM      [mem 0x0000000440000000-0x00000046bfffffff]
> >  440 [    0.000000]   Device   empty
> >  441 [    0.000000] Movable zone start for each node
> >  442 [    0.000000] Early memory node ranges
> >  443 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
> >  444 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
> >  445 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
> >  446 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
> >  447 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
> >  448 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]	Normal 0
> >  449 [    0.000000]   node   0: [mem 0x0000000440000000-0x000000237fffffff]	NVDIMM 0
> >  450 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]	Normal 1
> >  451 [    0.000000]   node   1: [mem 0x0000002780000000-0x00000046bfffffff]	NVDIMM 1
> >
> > If we disable NUMA, there is a result as Normal an NVDIMM zones will be overlapping
> with each other.
> > Current mm treats all memory regions equally, it divides zones just by size, like
> 16M for DMA, 4G for DMA32, and others above for Normal.
> > The spanned range of all zones couldn't be overlapped.
> 
> No, this is not correct. Zones can overlap.

Hi Michal,

Thanks for pointing it out.
But function zone_sizes_init decides arch_zone_lowest/highest_possible_pfn's size by max_low_pfn, then free_area_init_nodes/node are responsible for calculating the spanned size of zones from memblock memory regions.
So, ZONE_DMA and ZONE_DMA32 and ZONE_NORMAL have separate address scope. How can they be overlapped with each other?

Sincerely,
Huaisheng Ye | 叶怀胜
Linux kernel | Lenovo

















_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-09 14:04             ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-09 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Randy Dunlap, akpm, linux-mm, willy, vbabka, mgorman,
	pasha.tatashin, alexander.levin, hannes, penguin-kernel, colyli,
	NingTing Cheng, Ocean HY1 He, linux-kernel, linux-nvdimm

> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Michal Hocko
> 
> On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
> >
> > > On 05/07/2018 07:33 PM, Huaisheng HS1 Ye wrote:
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index c782e8f..5fe1f63 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -687,6 +687,22 @@ config ZONE_DEVICE
> > > >
> > > > +config ZONE_NVM
> > > > +	bool "Manage NVDIMM (pmem) by memory management (EXPERIMENTAL)"
> > > > +	depends on NUMA && X86_64
> > >
> > > Hi,
> > > I'm curious why this depends on NUMA. Couldn't it be useful in non-NUMA
> > > (i.e., UMA) configs?
> > >
> > I wrote these patches with two sockets testing platform, and there are two DDRs and
> two NVDIMMs have been installed to it.
> > So, for every socket it has one DDR and one NVDIMM with it. Here is memory region
> from memblock, you can get its distribution.
> >
> >  435 [    0.000000] Zone ranges:
> >  436 [    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
> >  437 [    0.000000]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
> >  438 [    0.000000]   Normal   [mem 0x0000000100000000-0x00000046bfffffff]
> >  439 [    0.000000]   NVM      [mem 0x0000000440000000-0x00000046bfffffff]
> >  440 [    0.000000]   Device   empty
> >  441 [    0.000000] Movable zone start for each node
> >  442 [    0.000000] Early memory node ranges
> >  443 [    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
> >  444 [    0.000000]   node   0: [mem 0x0000000000100000-0x00000000a69c2fff]
> >  445 [    0.000000]   node   0: [mem 0x00000000a7654000-0x00000000a85eefff]
> >  446 [    0.000000]   node   0: [mem 0x00000000ab399000-0x00000000af3f6fff]
> >  447 [    0.000000]   node   0: [mem 0x00000000af429000-0x00000000af7fffff]
> >  448 [    0.000000]   node   0: [mem 0x0000000100000000-0x000000043fffffff]	Normal 0
> >  449 [    0.000000]   node   0: [mem 0x0000000440000000-0x000000237fffffff]	NVDIMM 0
> >  450 [    0.000000]   node   1: [mem 0x0000002380000000-0x000000277fffffff]	Normal 1
> >  451 [    0.000000]   node   1: [mem 0x0000002780000000-0x00000046bfffffff]	NVDIMM 1
> >
> > If we disable NUMA, there is a result as Normal an NVDIMM zones will be overlapping
> with each other.
> > Current mm treats all memory regions equally, it divides zones just by size, like
> 16M for DMA, 4G for DMA32, and others above for Normal.
> > The spanned range of all zones couldn't be overlapped.
> 
> No, this is not correct. Zones can overlap.

Hi Michal,

Thanks for pointing it out.
But function zone_sizes_init decides arch_zone_lowest/highest_possible_pfn's size by max_low_pfn, then free_area_init_nodes/node are responsible for calculating the spanned size of zones from memblock memory regions.
So, ZONE_DMA and ZONE_DMA32 and ZONE_NORMAL have separate address scope. How can they be overlapped with each other?

Sincerely,
Huaisheng Ye | 叶怀胜
Linux kernel | Lenovo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  2018-05-09 14:04             ` Huaisheng HS1 Ye
@ 2018-05-09 20:56               ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-09 20:56 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: linux-kernel, Ocean HY1 He, penguin-kernel, NingTing Cheng,
	Randy Dunlap, pasha.tatashin, willy, alexander.levin, linux-mm,
	hannes, akpm, colyli, mgorman, vbabka, linux-nvdimm

On Wed 09-05-18 14:04:21, Huaisheng HS1 Ye wrote:
> > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Michal Hocko
> > 
> > On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
[...]
> > > Current mm treats all memory regions equally, it divides zones just by size, like
> > 16M for DMA, 4G for DMA32, and others above for Normal.
> > > The spanned range of all zones couldn't be overlapped.
> > 
> > No, this is not correct. Zones can overlap.
> 
> Hi Michal,
> 
> Thanks for pointing it out.
> But function zone_sizes_init decides
> arch_zone_lowest/highest_possible_pfn's size by max_low_pfn, then
> free_area_init_nodes/node are responsible for calculating the spanned
> size of zones from memblock memory regions.  So, ZONE_DMA and
> ZONE_DMA32 and ZONE_NORMAL have separate address scope. How can they
> be overlapped with each other?

Sorry, I could have been a bit more specific. DMA, DMA32 and Normal
zones are exclusive. They are mapped to a specific physical range of
memory so they cannot overlap. I was referring to a general property
that zones might interleave. Especially zone Normal, Movable and Device.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-09 20:56               ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-09 20:56 UTC (permalink / raw)
  To: Huaisheng HS1 Ye
  Cc: Randy Dunlap, akpm, linux-mm, willy, vbabka, mgorman,
	pasha.tatashin, alexander.levin, hannes, penguin-kernel, colyli,
	NingTing Cheng, Ocean HY1 He, linux-kernel, linux-nvdimm

On Wed 09-05-18 14:04:21, Huaisheng HS1 Ye wrote:
> > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of Michal Hocko
> > 
> > On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
[...]
> > > Current mm treats all memory regions equally, it divides zones just by size, like
> > 16M for DMA, 4G for DMA32, and others above for Normal.
> > > The spanned range of all zones couldn't be overlapped.
> > 
> > No, this is not correct. Zones can overlap.
> 
> Hi Michal,
> 
> Thanks for pointing it out.
> But function zone_sizes_init decides
> arch_zone_lowest/highest_possible_pfn's size by max_low_pfn, then
> free_area_init_nodes/node are responsible for calculating the spanned
> size of zones from memblock memory regions.  So, ZONE_DMA and
> ZONE_DMA32 and ZONE_NORMAL have separate address scope. How can they
> be overlapped with each other?

Sorry, I could have been a bit more specific. DMA, DMA32 and Normal
zones are exclusive. They are mapped to a specific physical range of
memory so they cannot overlap. I was referring to a general property
that zones might interleave. Especially zone Normal, Movable and Device.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
  2018-05-09 20:56               ` Michal Hocko
@ 2018-05-10  3:53                 ` Huaisheng HS1 Ye
  -1 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-10  3:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Ocean HY1 He, penguin-kernel, NingTing Cheng,
	Randy Dunlap, pasha.tatashin, willy, alexander.levin, linux-mm,
	hannes, akpm, colyli, mgorman, vbabka, linux-nvdimm

> 
> On Wed 09-05-18 14:04:21, Huaisheng HS1 Ye wrote:
> > > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of
> Michal Hocko
> > >
> > > On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
> [...]
> > > > Current mm treats all memory regions equally, it divides zones just by size,
> like
> > > 16M for DMA, 4G for DMA32, and others above for Normal.
> > > > The spanned range of all zones couldn't be overlapped.
> > >
> > > No, this is not correct. Zones can overlap.
> >
> > Hi Michal,
> >
> > Thanks for pointing it out.
> > But function zone_sizes_init decides
> > arch_zone_lowest/highest_possible_pfn's size by max_low_pfn, then
> > free_area_init_nodes/node are responsible for calculating the spanned
> > size of zones from memblock memory regions.  So, ZONE_DMA and
> > ZONE_DMA32 and ZONE_NORMAL have separate address scope. How can they
> > be overlapped with each other?
> 
> Sorry, I could have been a bit more specific. DMA, DMA32 and Normal
> zones are exclusive. They are mapped to a specific physical range of
> memory so they cannot overlap. I was referring to a general property
> that zones might interleave. Especially zone Normal, Movable and Device.

Exactly, here ZONE_NVM is a real physical range same as ZONE_DMA, ZONE_DMA32 and ZONE_Normal. So, it couldn't overlap with other zones.
Just like you mentioned, ZONE_MOVABLE is virtual zone, which comes ZONE_Normal.
The way of virtual zone is another implementation compared with current patch for ZONE_NVM.
It has advantages but also disadvantages, which need to be clarified and discussed.

Sincerely,
Huaisheng Ye
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE
@ 2018-05-10  3:53                 ` Huaisheng HS1 Ye
  0 siblings, 0 replies; 31+ messages in thread
From: Huaisheng HS1 Ye @ 2018-05-10  3:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Randy Dunlap, akpm, linux-mm, willy, vbabka, mgorman,
	pasha.tatashin, alexander.levin, hannes, penguin-kernel, colyli,
	NingTing Cheng, Ocean HY1 He, linux-kernel, linux-nvdimm

> 
> On Wed 09-05-18 14:04:21, Huaisheng HS1 Ye wrote:
> > > From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of
> Michal Hocko
> > >
> > > On Wed 09-05-18 04:22:10, Huaisheng HS1 Ye wrote:
> [...]
> > > > Current mm treats all memory regions equally, it divides zones just by size,
> like
> > > 16M for DMA, 4G for DMA32, and others above for Normal.
> > > > The spanned range of all zones couldn't be overlapped.
> > >
> > > No, this is not correct. Zones can overlap.
> >
> > Hi Michal,
> >
> > Thanks for pointing it out.
> > But function zone_sizes_init decides
> > arch_zone_lowest/highest_possible_pfn's size by max_low_pfn, then
> > free_area_init_nodes/node are responsible for calculating the spanned
> > size of zones from memblock memory regions.  So, ZONE_DMA and
> > ZONE_DMA32 and ZONE_NORMAL have separate address scope. How can they
> > be overlapped with each other?
> 
> Sorry, I could have been a bit more specific. DMA, DMA32 and Normal
> zones are exclusive. They are mapped to a specific physical range of
> memory so they cannot overlap. I was referring to a general property
> that zones might interleave. Especially zone Normal, Movable and Device.

Exactly, here ZONE_NVM is a real physical range same as ZONE_DMA, ZONE_DMA32 and ZONE_Normal. So, it couldn't overlap with other zones.
Just like you mentioned, ZONE_MOVABLE is virtual zone, which comes ZONE_Normal.
The way of virtual zone is another implementation compared with current patch for ZONE_NVM.
It has advantages but also disadvantages, which need to be clarified and discussed.

Sincerely,
Huaisheng Ye

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-08  2:30 ` Huaisheng Ye
  (?)
@ 2018-05-10  7:57   ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-10  7:57 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: linux-kernel, hehy1, penguin-kernel, chengnt, linux-nvdimm,
	pasha.tatashin, willy, alexander.levin, linux-mm, hannes, akpm,
	colyli, mgorman, vbabka

On Tue 08-05-18 10:30:22, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> corresponding drivers, which locate at \drivers\nvdimm\ and
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.
> 
> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
> zone, here we created a new zone type as NVM zone.

How do you define critical pages? Who is allowed to allocate from them?
You do not seem to add _any_ user of GFP_NVM.

> That is to say for
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.

This brings more questions than it answers. First of all is this going
to be any guarantee? Let's say I want GFP_NVM, can I get memory from
other zones? In other words is such a request allowed to fallback to
succeed? Are we allowed to reclaim memory from the new zone? What should
happen on the OOM? How is the user expected to restore the previous
content after reboot/crash?

I am sorry if these questions are answered in the respective patches but
it would be great to have this in the cover letter to have a good
overview of the whole design. From my quick glance over patches my
previous concerns about an additional zone still hold, though.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
@ 2018-05-10  7:57   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-10  7:57 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, willy, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, chengnt, hehy1,
	linux-kernel, linux-nvdimm

On Tue 08-05-18 10:30:22, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn
> are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> corresponding drivers, which locate at \drivers\nvdimm\ and
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.
> 
> With current kernel, many mm’s classical features like the buddy
> system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> What we are doing is to expand kernel mm’s capacity to make it to handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
> zone, here we created a new zone type as NVM zone.

How do you define critical pages? Who is allowed to allocate from them?
You do not seem to add _any_ user of GFP_NVM.

> That is to say for
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.

This brings more questions than it answers. First of all is this going
to be any guarantee? Let's say I want GFP_NVM, can I get memory from
other zones? In other words is such a request allowed to fallback to
succeed? Are we allowed to reclaim memory from the new zone? What should
happen on the OOM? How is the user expected to restore the previous
content after reboot/crash?

I am sorry if these questions are answered in the respective patches but
it would be great to have this in the cover letter to have a good
overview of the whole design. From my quick glance over patches my
previous concerns about an additional zone still hold, though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
@ 2018-05-10  7:57   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-10  7:57 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, willy, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, chengnt, hehy1,
	linux-kernel, linux-nvdimm

On Tue 08-05-18 10:30:22, Huaisheng Ye wrote:
> Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> DEVICE zone, which is a virtual zone and both its start and end of pfn
> are equal to 0, mm wouldna??t manage NVDIMM directly as DRAM, kernel uses
> corresponding drivers, which locate at \drivers\nvdimm\ and
> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> memory hot plug implementation.
> 
> With current kernel, many mma??s classical features like the buddy
> system, swap mechanism and page cache couldna??t be supported to NVDIMM.
> What we are doing is to expand kernel mma??s capacity to make it to handle
> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> separately, that means mm can only put the critical pages to NVDIMM
> zone, here we created a new zone type as NVM zone.

How do you define critical pages? Who is allowed to allocate from them?
You do not seem to add _any_ user of GFP_NVM.

> That is to say for
> traditional(or normal) pages which would be stored at DRAM scope like
> Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> them could be recovered from power fail or system crash, we make them
> to be persistent by storing them to NVM zone.

This brings more questions than it answers. First of all is this going
to be any guarantee? Let's say I want GFP_NVM, can I get memory from
other zones? In other words is such a request allowed to fallback to
succeed? Are we allowed to reclaim memory from the new zone? What should
happen on the OOM? How is the user expected to restore the previous
content after reboot/crash?

I am sorry if these questions are answered in the respective patches but
it would be great to have this in the cover letter to have a good
overview of the whole design. From my quick glance over patches my
previous concerns about an additional zone still hold, though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
  2018-05-10  7:57   ` Michal Hocko
  (?)
@ 2018-05-10  8:41     ` Michal Hocko
  -1 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-10  8:41 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: linux-kernel, hehy1, penguin-kernel, chengnt, linux-nvdimm,
	pasha.tatashin, willy, alexander.levin, linux-mm, hannes, akpm,
	colyli, mgorman, vbabka

I have only now noticed that you have posted this few days ago
http://lkml.kernel.org/r/1525704627-30114-1-git-send-email-yehs1@lenovo.com
There were some good questions asked there and I have many that are
common yet they are not covered in the cover letter. Please _always_
make sure to answer review comments before reposting. Otherwise some
important parts gets lost on the way.

On Thu 10-05-18 09:57:59, Michal Hocko wrote:
> On Tue 08-05-18 10:30:22, Huaisheng Ye wrote:
> > Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> > DEVICE zone, which is a virtual zone and both its start and end of pfn
> > are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> > corresponding drivers, which locate at \drivers\nvdimm\ and
> > \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> > memory hot plug implementation.
> > 
> > With current kernel, many mm’s classical features like the buddy
> > system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> > What we are doing is to expand kernel mm’s capacity to make it to handle
> > NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> > separately, that means mm can only put the critical pages to NVDIMM
> > zone, here we created a new zone type as NVM zone.
> 
> How do you define critical pages? Who is allowed to allocate from them?
> You do not seem to add _any_ user of GFP_NVM.
> 
> > That is to say for
> > traditional(or normal) pages which would be stored at DRAM scope like
> > Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> > them could be recovered from power fail or system crash, we make them
> > to be persistent by storing them to NVM zone.
> 
> This brings more questions than it answers. First of all is this going
> to be any guarantee? Let's say I want GFP_NVM, can I get memory from
> other zones? In other words is such a request allowed to fallback to
> succeed? Are we allowed to reclaim memory from the new zone? What should
> happen on the OOM? How is the user expected to restore the previous
> content after reboot/crash?
> 
> I am sorry if these questions are answered in the respective patches but
> it would be great to have this in the cover letter to have a good
> overview of the whole design. From my quick glance over patches my
> previous concerns about an additional zone still hold, though.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
@ 2018-05-10  8:41     ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-10  8:41 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, willy, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, chengnt, hehy1,
	linux-kernel, linux-nvdimm

I have only now noticed that you have posted this few days ago
http://lkml.kernel.org/r/1525704627-30114-1-git-send-email-yehs1@lenovo.com
There were some good questions asked there and I have many that are
common yet they are not covered in the cover letter. Please _always_
make sure to answer review comments before reposting. Otherwise some
important parts gets lost on the way.

On Thu 10-05-18 09:57:59, Michal Hocko wrote:
> On Tue 08-05-18 10:30:22, Huaisheng Ye wrote:
> > Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> > DEVICE zone, which is a virtual zone and both its start and end of pfn
> > are equal to 0, mm wouldn’t manage NVDIMM directly as DRAM, kernel uses
> > corresponding drivers, which locate at \drivers\nvdimm\ and
> > \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> > memory hot plug implementation.
> > 
> > With current kernel, many mm’s classical features like the buddy
> > system, swap mechanism and page cache couldn’t be supported to NVDIMM.
> > What we are doing is to expand kernel mm’s capacity to make it to handle
> > NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> > separately, that means mm can only put the critical pages to NVDIMM
> > zone, here we created a new zone type as NVM zone.
> 
> How do you define critical pages? Who is allowed to allocate from them?
> You do not seem to add _any_ user of GFP_NVM.
> 
> > That is to say for
> > traditional(or normal) pages which would be stored at DRAM scope like
> > Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> > them could be recovered from power fail or system crash, we make them
> > to be persistent by storing them to NVM zone.
> 
> This brings more questions than it answers. First of all is this going
> to be any guarantee? Let's say I want GFP_NVM, can I get memory from
> other zones? In other words is such a request allowed to fallback to
> succeed? Are we allowed to reclaim memory from the new zone? What should
> happen on the OOM? How is the user expected to restore the previous
> content after reboot/crash?
> 
> I am sorry if these questions are answered in the respective patches but
> it would be great to have this in the cover letter to have a good
> overview of the whole design. From my quick glance over patches my
> previous concerns about an additional zone still hold, though.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone
@ 2018-05-10  8:41     ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2018-05-10  8:41 UTC (permalink / raw)
  To: Huaisheng Ye
  Cc: akpm, linux-mm, willy, vbabka, mgorman, pasha.tatashin,
	alexander.levin, hannes, penguin-kernel, colyli, chengnt, hehy1,
	linux-kernel, linux-nvdimm

I have only now noticed that you have posted this few days ago
http://lkml.kernel.org/r/1525704627-30114-1-git-send-email-yehs1@lenovo.com
There were some good questions asked there and I have many that are
common yet they are not covered in the cover letter. Please _always_
make sure to answer review comments before reposting. Otherwise some
important parts gets lost on the way.

On Thu 10-05-18 09:57:59, Michal Hocko wrote:
> On Tue 08-05-18 10:30:22, Huaisheng Ye wrote:
> > Traditionally, NVDIMMs are treated by mm(memory management) subsystem as
> > DEVICE zone, which is a virtual zone and both its start and end of pfn
> > are equal to 0, mm wouldna??t manage NVDIMM directly as DRAM, kernel uses
> > corresponding drivers, which locate at \drivers\nvdimm\ and
> > \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free with
> > memory hot plug implementation.
> > 
> > With current kernel, many mma??s classical features like the buddy
> > system, swap mechanism and page cache couldna??t be supported to NVDIMM.
> > What we are doing is to expand kernel mma??s capacity to make it to handle
> > NVDIMM like DRAM. Furthermore we make mm could treat DRAM and NVDIMM
> > separately, that means mm can only put the critical pages to NVDIMM
> > zone, here we created a new zone type as NVM zone.
> 
> How do you define critical pages? Who is allowed to allocate from them?
> You do not seem to add _any_ user of GFP_NVM.
> 
> > That is to say for
> > traditional(or normal) pages which would be stored at DRAM scope like
> > Normal, DMA32 and DMA zones. But for the critical pages, which we hope
> > them could be recovered from power fail or system crash, we make them
> > to be persistent by storing them to NVM zone.
> 
> This brings more questions than it answers. First of all is this going
> to be any guarantee? Let's say I want GFP_NVM, can I get memory from
> other zones? In other words is such a request allowed to fallback to
> succeed? Are we allowed to reclaim memory from the new zone? What should
> happen on the OOM? How is the user expected to restore the previous
> content after reboot/crash?
> 
> I am sorry if these questions are answered in the respective patches but
> it would be great to have this in the cover letter to have a good
> overview of the whole design. From my quick glance over patches my
> previous concerns about an additional zone still hold, though.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-05-10  8:41 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-08  2:30 [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone Huaisheng Ye
2018-05-08  2:30 ` Huaisheng Ye
     [not found] ` <1525746628-114136-2-git-send-email-yehs1@lenovo.com>
2018-05-08  2:30   ` [External] [RFC PATCH v1 1/6] mm/memblock: Expand definition of flags to support NVDIMM Huaisheng HS1 Ye
2018-05-08  2:30     ` Huaisheng HS1 Ye
2018-05-08  2:30 ` [RFC PATCH v1 4/6] arch/x86/kernel: mark NVDIMM regions from e820_table Huaisheng Ye
     [not found] ` <1525746628-114136-3-git-send-email-yehs1@lenovo.com>
2018-05-08  2:32   ` [External] [RFC PATCH v1 2/6] mm/page_alloc.c: get pfn range with flags of memblock Huaisheng HS1 Ye
2018-05-08  2:32     ` Huaisheng HS1 Ye
     [not found] ` <1525746628-114136-4-git-send-email-yehs1@lenovo.com>
2018-05-08  2:33   ` [External] [RFC PATCH v1 3/6] mm, zone_type: create ZONE_NVM and fill into GFP_ZONE_TABLE Huaisheng HS1 Ye
2018-05-08  2:33     ` Huaisheng HS1 Ye
2018-05-08  4:43     ` Randy Dunlap
2018-05-08  4:43       ` Randy Dunlap
2018-05-09  4:22       ` Huaisheng HS1 Ye
2018-05-09  4:22         ` Huaisheng HS1 Ye
2018-05-09 11:47         ` Michal Hocko
2018-05-09 11:47           ` Michal Hocko
2018-05-09 14:04           ` Huaisheng HS1 Ye
2018-05-09 14:04             ` Huaisheng HS1 Ye
2018-05-09 20:56             ` Michal Hocko
2018-05-09 20:56               ` Michal Hocko
2018-05-10  3:53               ` Huaisheng HS1 Ye
2018-05-10  3:53                 ` Huaisheng HS1 Ye
     [not found] ` <1525746628-114136-6-git-send-email-yehs1@lenovo.com>
2018-05-08  2:34   ` [External] [RFC PATCH v1 5/6] mm: get zone spanned pages separately for DRAM and NVDIMM Huaisheng HS1 Ye
2018-05-08  2:34     ` Huaisheng HS1 Ye
     [not found] ` <1525746628-114136-7-git-send-email-yehs1@lenovo.com>
2018-05-08  2:35   ` [External] [RFC PATCH v1 6/6] arch/x86/mm: create page table mapping for DRAM and NVDIMM both Huaisheng HS1 Ye
2018-05-08  2:35     ` Huaisheng HS1 Ye
2018-05-10  7:57 ` [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone Michal Hocko
2018-05-10  7:57   ` Michal Hocko
2018-05-10  7:57   ` Michal Hocko
2018-05-10  8:41   ` Michal Hocko
2018-05-10  8:41     ` Michal Hocko
2018-05-10  8:41     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.