LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
@ 2019-09-10  6:28 Aneesh Kumar K.V
  2019-09-10  6:28 ` [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range Aneesh Kumar K.V
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Aneesh Kumar K.V @ 2019-09-10  6:28 UTC (permalink / raw)
  To: dan.j.williams, mpe, oohall
  Cc: Sachin Sant, Aneesh Kumar K.V, linuxppc-dev, linux-nvdimm

With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
area. Some architectures map the memmap area with large page size. On
architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
This maps a namespace size of 16G.

When populating memmap region with 16MB page from the device area,
make sure the allocated space is not used to map resources outside this
namespace. Such usage of device area will prevent a namespace destroy.

Add resource end pnf in altmap and use that to check if the memmap area
allocation can map pfn outside the namespace. On ppc64 in such case we fallback
to allocation from memory.

This fix kernel crash reported below:

[  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
[  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
[  133.464760] Faulting instruction address: 0xc00000000007580c
[  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
[  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
.....
[  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
[  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
[  133.464910] Call Trace:
[  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
[  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
[  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
[  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
[  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
[  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
[  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
[  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
[  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
[  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
[  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
[  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
[  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
[  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/init_64.c | 17 ++++++++++++++++-
 drivers/nvdimm/pfn_devs.c |  2 ++
 include/linux/memremap.h  |  1 +
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a44f6281ca3a..4e08246acd79 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned long phys,
 	vmemmap_list = vmem_back;
 }
 
+static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+				unsigned long page_size)
+{
+	unsigned long nr_pfn = page_size / sizeof(struct page);
+	unsigned long start_pfn = page_to_pfn((struct page *)start);
+
+	if ((start_pfn + nr_pfn) > altmap->end_pfn)
+		return true;
+
+	if (start_pfn < altmap->base_pfn)
+		return true;
+
+	return false;
+}
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
@@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		 * fail due to alignment issues when using 16MB hugepages, so
 		 * fall back to system memory if the altmap allocation fail.
 		 */
-		if (altmap) {
+		if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
 			p = altmap_alloc_block_buf(page_size, altmap);
 			if (!p)
 				pr_debug("altmap block allocation failed, falling back to system memory");
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e7b11cf1aae..a616d69c8224 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 	struct nd_namespace_common *ndns = nd_pfn->ndns;
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	resource_size_t base = nsio->res.start + start_pad;
+	resource_size_t end = nsio->res.end - end_trunc;
 	struct vmem_altmap __altmap = {
 		.base_pfn = init_altmap_base(base),
 		.reserve = init_altmap_reserve(base),
+		.end_pfn = PHYS_PFN(end),
 	};
 
 	memcpy(res, &nsio->res, sizeof(*res));
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f8a5b2a19945..c70996fe48c8 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,6 +17,7 @@ struct device;
  */
 struct vmem_altmap {
 	const unsigned long base_pfn;
+	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
 	unsigned long align;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range
  2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
@ 2019-09-10  6:28 ` Aneesh Kumar K.V
  2019-09-16 17:46   ` Dan Williams
  2019-09-17  5:47   ` Oliver O'Halloran
  2019-09-10  7:40 ` [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Pankaj Gupta
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Aneesh Kumar K.V @ 2019-09-10  6:28 UTC (permalink / raw)
  To: dan.j.williams, mpe, oohall; +Cc: Aneesh Kumar K.V, linuxppc-dev, linux-nvdimm

With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
can map the memmap area using 16MB hugepage size and that can cover
a memory range of 16G.

While enabling new pmem namespaces, since memory is added in sub-section chunks,
before creating a new memmap mapping, kernel should check whether there is an
existing memmap mapping covering the new pmem namespace. Currently, this is
validated by checking whether the section covering the range is already
initialized or not. Considering there can be multiple namespaces in the same
section this can result in wrong validation. Update this to check for
sub-sections in the range. This is done by checking for all pfns in the range we
are mapping.

We could optimize this by checking only just one pfn in each sub-section. But
since this is not fast-path we keep this simple.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/init_64.c | 45 ++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 4e08246acd79..7710ccdc19a2 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -70,30 +70,24 @@ EXPORT_SYMBOL_GPL(kernstart_addr);
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
- * Given an address within the vmemmap, determine the pfn of the page that
- * represents the start of the section it is within.  Note that we have to
- * do this by hand as the proffered address may not be correctly aligned.
- * Subtraction of non-aligned pointers produces undefined results.
- */
-static unsigned long __meminit vmemmap_section_start(unsigned long page)
-{
-	unsigned long offset = page - ((unsigned long)(vmemmap));
-
-	/* Return the pfn of the start of the section. */
-	return (offset / sizeof(struct page)) & PAGE_SECTION_MASK;
-}
-
-/*
- * Check if this vmemmap page is already initialised.  If any section
+ * Check if this vmemmap page is already initialised.  If any sub section
  * which overlaps this vmemmap page is initialised then this page is
  * initialised already.
  */
-static int __meminit vmemmap_populated(unsigned long start, int page_size)
+
+static int __meminit vmemmap_populated(unsigned long start, int size)
 {
-	unsigned long end = start + page_size;
-	start = (unsigned long)(pfn_to_page(vmemmap_section_start(start)));
+	unsigned long end = start + size;
 
-	for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page)))
+	/* start is size aligned and it is always > sizeof(struct page) */
+	VM_BUG_ON(start & sizeof(struct page));
+	for (; start < end; start += sizeof(struct page))
+		/*
+		 * pfn valid check here is intended to really check
+		 * whether we have any subsection already initialized
+		 * in this range. We keep it simple by checking every
+		 * pfn in the range.
+		 */
 		if (pfn_valid(page_to_pfn((struct page *)start)))
 			return 1;
 
@@ -201,6 +195,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		void *p = NULL;
 		int rc;
 
+		/*
+		 * This vmemmap range is backing different subsections. If any
+		 * of that subsection is marked valid, that means we already
+		 * have initialized a page table covering this range and hence
+		 * the vmemmap range is populated.
+		 */
 		if (vmemmap_populated(start, page_size))
 			continue;
 
@@ -290,9 +290,10 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
 		struct page *page;
 
 		/*
-		 * the section has already be marked as invalid, so
-		 * vmemmap_populated() true means some other sections still
-		 * in this page, so skip it.
+		 * We have already marked the subsection we are trying to remove
+		 * invalid. So if we want to remove the vmemmap range, we
+		 * need to make sure there is no subsection marked valid
+		 * in this range.
 		 */
 		if (vmemmap_populated(start, page_size))
 			continue;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
  2019-09-10  6:28 ` [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range Aneesh Kumar K.V
@ 2019-09-10  7:40 ` Pankaj Gupta
  2019-09-10  8:10 ` Dan Williams
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Pankaj Gupta @ 2019-09-10  7:40 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Sachin Sant, linux-nvdimm, oohall, dan j williams, linuxppc-dev


> 
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
> 
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
> 
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we
> fallback
> to allocation from memory.
> 
> This fix kernel crash reported below:
> 
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133
> devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc00000000007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .....
> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0
> (unreliable)
> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44]
> section_deactivate+0x1a4/0x240
> [  133.464928] [c000007cbfd0f980] [c000000000386270]
> __remove_pages+0x3a0/0x590
> [  133.464935] [c000007cbfd0fa50] [c000000000074158]
> arch_remove_memory+0x88/0x160
> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0]
> devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0]
> devm_action_release+0x30/0x50
> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4]
> release_nodes+0x344/0x400
> [  133.464961] [c000007cbfd0fc40] [c00000000073378c]
> device_release_driver_internal+0x15c/0x250
> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc]
> kernfs_fop_write+0x17c/0x250
> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
> 
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/init_64.c | 17 ++++++++++++++++-
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned
> long phys,
>  	vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long
> start,
> +				unsigned long page_size)
> +{
> +	unsigned long nr_pfn = page_size / sizeof(struct page);
> +	unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> +	if ((start_pfn + nr_pfn) > altmap->end_pfn)
> +		return true;
> +
> +	if (start_pfn < altmap->base_pfn)
> +		return true;
> +
> +	return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int
>  node,
>  		struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start,
> unsigned long end, int node,
>  		 * fail due to alignment issues when using 16MB hugepages, so
>  		 * fall back to system memory if the altmap allocation fail.
>  		 */
> -		if (altmap) {
> +		if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
>  			p = altmap_alloc_block_buf(page_size, altmap);
>  			if (!p)
>  				pr_debug("altmap block allocation failed, falling back to system
>  				memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
> struct dev_pagemap *pgmap)
>  	struct nd_namespace_common *ndns = nd_pfn->ndns;
>  	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
>  	resource_size_t base = nsio->res.start + start_pad;
> +	resource_size_t end = nsio->res.end - end_trunc;
>  	struct vmem_altmap __altmap = {
>  		.base_pfn = init_altmap_base(base),
>  		.reserve = init_altmap_reserve(base),
> +		.end_pfn = PHYS_PFN(end),
>  	};
>  
>  	memcpy(res, &nsio->res, sizeof(*res));
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index f8a5b2a19945..c70996fe48c8 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -17,6 +17,7 @@ struct device;
>   */
>  struct vmem_altmap {
>  	const unsigned long base_pfn;
> +	const unsigned long end_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
>  	unsigned long align;
> --
> 2.21.0

This patch looks good to me. It helps to prevent
namespace access across boundaries for altmap
hugepage allocation.

Reviewed-by: Pankaj Gupta <pagupta@redhat.com>

Thanks,
Pankaj
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
  2019-09-10  6:28 ` [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range Aneesh Kumar K.V
  2019-09-10  7:40 ` [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Pankaj Gupta
@ 2019-09-10  8:10 ` Dan Williams
  2019-09-10  8:30   ` Aneesh Kumar K.V
  2019-09-10  8:29 ` Santosh Sivaraj
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Dan Williams @ 2019-09-10  8:10 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Sachin Sant, Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> to allocation from memory.

Shouldn't this instead be comprehended by nd_pfn_init() to increase
the reservation size so that it fits in the alignment? It may not
always be possible to fall back to allocation from memory for
extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
there may not be enough DRAM to store the memmap.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2019-09-10  8:10 ` Dan Williams
@ 2019-09-10  8:29 ` Santosh Sivaraj
  2019-09-11 12:03 ` Johannes Thumshirn
  2019-09-16 17:58 ` Dan Williams
  5 siblings, 0 replies; 14+ messages in thread
From: Santosh Sivaraj @ 2019-09-10  8:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V, dan.j.williams, mpe, oohall
  Cc: Sachin Sant, Aneesh Kumar K.V, linuxppc-dev, linux-nvdimm

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> to allocation from memory.
>
> This fix kernel crash reported below:
>
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc00000000007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .....
> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/init_64.c | 17 ++++++++++++++++-
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)

Tested-by: Santosh Sivaraj <santosh@fossix.org>

>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned long phys,
>  	vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
> +				unsigned long page_size)
> +{
> +	unsigned long nr_pfn = page_size / sizeof(struct page);
> +	unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> +	if ((start_pfn + nr_pfn) > altmap->end_pfn)
> +		return true;
> +
> +	if (start_pfn < altmap->base_pfn)
> +		return true;
> +
> +	return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		 * fail due to alignment issues when using 16MB hugepages, so
>  		 * fall back to system memory if the altmap allocation fail.
>  		 */
> -		if (altmap) {
> +		if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
>  			p = altmap_alloc_block_buf(page_size, altmap);
>  			if (!p)
>  				pr_debug("altmap block allocation failed, falling back to system memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
>  	struct nd_namespace_common *ndns = nd_pfn->ndns;
>  	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
>  	resource_size_t base = nsio->res.start + start_pad;
> +	resource_size_t end = nsio->res.end - end_trunc;
>  	struct vmem_altmap __altmap = {
>  		.base_pfn = init_altmap_base(base),
>  		.reserve = init_altmap_reserve(base),
> +		.end_pfn = PHYS_PFN(end),
>  	};
>  
>  	memcpy(res, &nsio->res, sizeof(*res));
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index f8a5b2a19945..c70996fe48c8 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -17,6 +17,7 @@ struct device;
>   */
>  struct vmem_altmap {
>  	const unsigned long base_pfn;
> +	const unsigned long end_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
>  	unsigned long align;
> -- 
> 2.21.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  8:10 ` Dan Williams
@ 2019-09-10  8:30   ` Aneesh Kumar K.V
  2019-09-10  9:08     ` Dan Williams
  0 siblings, 1 reply; 14+ messages in thread
From: Aneesh Kumar K.V @ 2019-09-10  8:30 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sachin Sant, Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On 9/10/19 1:40 PM, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
>> area. Some architectures map the memmap area with large page size. On
>> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
>> This maps a namespace size of 16G.
>>
>> When populating memmap region with 16MB page from the device area,
>> make sure the allocated space is not used to map resources outside this
>> namespace. Such usage of device area will prevent a namespace destroy.
>>
>> Add resource end pnf in altmap and use that to check if the memmap area
>> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
>> to allocation from memory.
> 
> Shouldn't this instead be comprehended by nd_pfn_init() to increase
> the reservation size so that it fits in the alignment? It may not
> always be possible to fall back to allocation from memory for
> extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
> there may not be enough DRAM to store the memmap.
> 

We do switch between DRAM and device for memmap allocation. ppc64 
vmemmap_populate  does

if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
	p = altmap_alloc_block_buf(page_size, altmap);
	if (!p)
		pr_debug("altmap block allocation failed, falling back to system memory");
	}
	if (!p)
		p = vmemmap_alloc_block_buf(page_size, node);
	

With that we should be using DRAM for the first and the last mapping, 
rest of the memmap should be backed by device.

-aneesh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  8:30   ` Aneesh Kumar K.V
@ 2019-09-10  9:08     ` Dan Williams
  0 siblings, 0 replies; 14+ messages in thread
From: Dan Williams @ 2019-09-10  9:08 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Sachin Sant, Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On Tue, Sep 10, 2019 at 1:31 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 9/10/19 1:40 PM, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> >> area. Some architectures map the memmap area with large page size. On
> >> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> >> This maps a namespace size of 16G.
> >>
> >> When populating memmap region with 16MB page from the device area,
> >> make sure the allocated space is not used to map resources outside this
> >> namespace. Such usage of device area will prevent a namespace destroy.
> >>
> >> Add resource end pnf in altmap and use that to check if the memmap area
> >> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> >> to allocation from memory.
> >
> > Shouldn't this instead be comprehended by nd_pfn_init() to increase
> > the reservation size so that it fits in the alignment? It may not
> > always be possible to fall back to allocation from memory for
> > extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
> > there may not be enough DRAM to store the memmap.
> >
>
> We do switch between DRAM and device for memmap allocation. ppc64
> vmemmap_populate  does
>
> if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
>         p = altmap_alloc_block_buf(page_size, altmap);
>         if (!p)
>                 pr_debug("altmap block allocation failed, falling back to system memory");
>         }
>         if (!p)
>                 p = vmemmap_alloc_block_buf(page_size, node);
>
>
> With that we should be using DRAM for the first and the last mapping,
> rest of the memmap should be backed by device.

Ah, ok, makes sense.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2019-09-10  8:29 ` Santosh Sivaraj
@ 2019-09-11 12:03 ` Johannes Thumshirn
  2019-09-16 17:58 ` Dan Williams
  5 siblings, 0 replies; 14+ messages in thread
From: Johannes Thumshirn @ 2019-09-11 12:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V, dan.j.williams, mpe, oohall
  Cc: Sachin Sant, linuxppc-dev, linux-nvdimm

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 247165, AG München)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range
  2019-09-10  6:28 ` [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range Aneesh Kumar K.V
@ 2019-09-16 17:46   ` Dan Williams
  2019-09-17  9:54     ` Aneesh Kumar K.V
  2019-09-17  5:47   ` Oliver O'Halloran
  1 sibling, 1 reply; 14+ messages in thread
From: Dan Williams @ 2019-09-16 17:46 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
> pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
> can map the memmap area using 16MB hugepage size and that can cover
> a memory range of 16G.
>
> While enabling new pmem namespaces, since memory is added in sub-section chunks,
> before creating a new memmap mapping, kernel should check whether there is an
> existing memmap mapping covering the new pmem namespace. Currently, this is
> validated by checking whether the section covering the range is already
> initialized or not. Considering there can be multiple namespaces in the same
> section this can result in wrong validation. Update this to check for
> sub-sections in the range. This is done by checking for all pfns in the range we
> are mapping.
>
> We could optimize this by checking only just one pfn in each sub-section. But
> since this is not fast-path we keep this simple.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/init_64.c | 45 ++++++++++++++++++++-------------------
>  1 file changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 4e08246acd79..7710ccdc19a2 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -70,30 +70,24 @@ EXPORT_SYMBOL_GPL(kernstart_addr);
>
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  /*
> - * Given an address within the vmemmap, determine the pfn of the page that
> - * represents the start of the section it is within.  Note that we have to
> - * do this by hand as the proffered address may not be correctly aligned.
> - * Subtraction of non-aligned pointers produces undefined results.
> - */
> -static unsigned long __meminit vmemmap_section_start(unsigned long page)
> -{
> -       unsigned long offset = page - ((unsigned long)(vmemmap));
> -
> -       /* Return the pfn of the start of the section. */
> -       return (offset / sizeof(struct page)) & PAGE_SECTION_MASK;
> -}
> -
> -/*
> - * Check if this vmemmap page is already initialised.  If any section
> + * Check if this vmemmap page is already initialised.  If any sub section
>   * which overlaps this vmemmap page is initialised then this page is
>   * initialised already.
>   */
> -static int __meminit vmemmap_populated(unsigned long start, int page_size)
> +
> +static int __meminit vmemmap_populated(unsigned long start, int size)
>  {
> -       unsigned long end = start + page_size;
> -       start = (unsigned long)(pfn_to_page(vmemmap_section_start(start)));
> +       unsigned long end = start + size;
>
> -       for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page)))
> +       /* start is size aligned and it is always > sizeof(struct page) */
> +       VM_BUG_ON(start & sizeof(struct page));

If start is size aligned why not include that assumption in the VM_BUG_ON()?

Otherwise it seems this patch could be reduced simply by:

s/PAGE_SECTION_MASK/PAGE_SUBSECTION_MASK/
s/PAGES_PER_SECTION/PAGES_PER_SUBSECTION/

...and leave the vmemmap_section_start() function in place? In other
words this path used to guarantee that 'start' was aligned to the
minimum mem-hotplug granularity, the change looks ok on the surface,
but it seems a subtle change in semantics.

Can you get an ack from a powerpc maintainer, or maybe this patch
should route through the powerpc tree?

I'll take patch1 through the nvdimm tree.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2019-09-11 12:03 ` Johannes Thumshirn
@ 2019-09-16 17:58 ` Dan Williams
  2019-09-17  7:39   ` Aneesh Kumar K.V
  5 siblings, 1 reply; 14+ messages in thread
From: Dan Williams @ 2019-09-16 17:58 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Sachin Sant, Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> to allocation from memory.
>
> This fix kernel crash reported below:
>
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc00000000007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .....
> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250

Question, does this crash only happen when the namespace is not 16MB
aligned? In other words was this bug exposed by the subsection-hotplug
changes and should it contain Fixes: tag for those commits?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range
  2019-09-10  6:28 ` [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range Aneesh Kumar K.V
  2019-09-16 17:46   ` Dan Williams
@ 2019-09-17  5:47   ` Oliver O'Halloran
  1 sibling, 0 replies; 14+ messages in thread
From: Oliver O'Halloran @ 2019-09-17  5:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Dan Williams, linuxppc-dev, linux-nvdimm

On Tue, Sep 10, 2019 at 4:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
> pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
> can map the memmap area using 16MB hugepage size and that can cover
> a memory range of 16G.
>
> While enabling new pmem namespaces, since memory is added in sub-section chunks,
> before creating a new memmap mapping, kernel should check whether there is an
> existing memmap mapping covering the new pmem namespace. Currently, this is
> validated by checking whether the section covering the range is already
> initialized or not. Considering there can be multiple namespaces in the same
> section this can result in wrong validation. Update this to check for
> sub-sections in the range. This is done by checking for all pfns in the range we
> are mapping.
>
> We could optimize this by checking only just one pfn in each sub-section. But
> since this is not fast-path we keep this simple.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/init_64.c | 45 ++++++++++++++++++++-------------------
>  1 file changed, 23 insertions(+), 22 deletions(-)
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index 4e08246acd79..7710ccdc19a2 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -70,30 +70,24 @@ EXPORT_SYMBOL_GPL(kernstart_addr);
>
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  /*
> - * Given an address within the vmemmap, determine the pfn of the page that
> - * represents the start of the section it is within.  Note that we have to
> - * do this by hand as the proffered address may not be correctly aligned.
> - * Subtraction of non-aligned pointers produces undefined results.
> - */
> -static unsigned long __meminit vmemmap_section_start(unsigned long page)
> -{
> -       unsigned long offset = page - ((unsigned long)(vmemmap));
> -
> -       /* Return the pfn of the start of the section. */
> -       return (offset / sizeof(struct page)) & PAGE_SECTION_MASK;
> -}

If you want to go with Dan's suggestion of keeping the function and
using PAGE_SUBSECTION_MASK then can you fold the pfn_to_page() below
into vmemmap_section_start()? The current behaviour of returning a pfn
makes no sense to me.

> - * Check if this vmemmap page is already initialised.  If any section
> + * Check if this vmemmap page is already initialised.  If any sub section
>   * which overlaps this vmemmap page is initialised then this page is
>   * initialised already.
>   */
> -static int __meminit vmemmap_populated(unsigned long start, int page_size)
> +
> +static int __meminit vmemmap_populated(unsigned long start, int size)
>  {
> -       unsigned long end = start + page_size;
> -       start = (unsigned long)(pfn_to_page(vmemmap_section_start(start)));
> +       unsigned long end = start + size;

> -       for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page)))
> +       /* start is size aligned and it is always > sizeof(struct page) */
> +       VM_BUG_ON(start & sizeof(struct page));

Shouldn't the test be: start & (sizeof(struct page) - 1)?
VM_BUG_ON(start != ALIGN(start, page_size)) would be clearer.

> +       for (; start < end; start += sizeof(struct page))
> +               /*
> +                * pfn valid check here is intended to really check
> +                * whether we have any subsection already initialized
> +                * in this range. We keep it simple by checking every
> +                * pfn in the range.
> +                */
>                 if (pfn_valid(page_to_pfn((struct page *)start)))
>                         return 1;

Having a few lines of separation between the for () and the loop body
always looks a bit sketch, even if it's just a comment. Wrapping the
block in braces or moving the comment above the loop is probably a
good idea.

Looks fine otherwise

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-16 17:58 ` Dan Williams
@ 2019-09-17  7:39   ` Aneesh Kumar K.V
  2019-09-17 14:24     ` Dan Williams
  0 siblings, 1 reply; 14+ messages in thread
From: Aneesh Kumar K.V @ 2019-09-17  7:39 UTC (permalink / raw)
  To: Dan Williams
  Cc: Sachin Sant, Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On 9/16/19 11:28 PM, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
>> area. Some architectures map the memmap area with large page size. On
>> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
>> This maps a namespace size of 16G.
>>
>> When populating memmap region with 16MB page from the device area,
>> make sure the allocated space is not used to map resources outside this
>> namespace. Such usage of device area will prevent a namespace destroy.
>>
>> Add resource end pnf in altmap and use that to check if the memmap area
>> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
>> to allocation from memory.
>>
>> This fix kernel crash reported below:
>>
>> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
>> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
>> [  133.464760] Faulting instruction address: 0xc00000000007580c
>> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>> .....
>> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
>> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
>> [  133.464910] Call Trace:
>> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
>> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
>> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
>> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
>> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
>> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
>> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
>> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
>> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
>> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
>> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
>> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
>> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
>> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
> 
> Question, does this crash only happen when the namespace is not 16MB
> aligned? In other words was this bug exposed by the subsection-hotplug
> changes and should it contain Fixes: tag for those commits?
> 

We are able to hit this crash even with older kernels. This happens when 
we have multiple namespaces from the same region of size 26G. In that 
case we need to make sure we don't end up using altmap from one 
namespace for mapping vmemmap of the adjacent namespace.

Considering this impacts ppc64 and we got the ppc64 SCM support in 4.20. 
may be  we can do just
Cc: <stable@vger.kernel.org> # 4.20+

-aneesh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range
  2019-09-16 17:46   ` Dan Williams
@ 2019-09-17  9:54     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 14+ messages in thread
From: Aneesh Kumar K.V @ 2019-09-17  9:54 UTC (permalink / raw)
  To: Dan Williams; +Cc: Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On 9/16/19 11:16 PM, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> With commit: 7cc7867fb061 ("mm/devm_memremap_pages: enable sub-section remap")
>> pmem namespaces are remapped in 2M chunks. On architectures like ppc64 we
>> can map the memmap area using 16MB hugepage size and that can cover
>> a memory range of 16G.
>>
>> While enabling new pmem namespaces, since memory is added in sub-section chunks,
>> before creating a new memmap mapping, kernel should check whether there is an
>> existing memmap mapping covering the new pmem namespace. Currently, this is
>> validated by checking whether the section covering the range is already
>> initialized or not. Considering there can be multiple namespaces in the same
>> section this can result in wrong validation. Update this to check for
>> sub-sections in the range. This is done by checking for all pfns in the range we
>> are mapping.
>>
>> We could optimize this by checking only just one pfn in each sub-section. But
>> since this is not fast-path we keep this simple.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/mm/init_64.c | 45 ++++++++++++++++++++-------------------
>>   1 file changed, 23 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
>> index 4e08246acd79..7710ccdc19a2 100644
>> --- a/arch/powerpc/mm/init_64.c
>> +++ b/arch/powerpc/mm/init_64.c
>> @@ -70,30 +70,24 @@ EXPORT_SYMBOL_GPL(kernstart_addr);
>>
>>   #ifdef CONFIG_SPARSEMEM_VMEMMAP
>>   /*
>> - * Given an address within the vmemmap, determine the pfn of the page that
>> - * represents the start of the section it is within.  Note that we have to
>> - * do this by hand as the proffered address may not be correctly aligned.
>> - * Subtraction of non-aligned pointers produces undefined results.
>> - */
>> -static unsigned long __meminit vmemmap_section_start(unsigned long page)
>> -{
>> -       unsigned long offset = page - ((unsigned long)(vmemmap));
>> -
>> -       /* Return the pfn of the start of the section. */
>> -       return (offset / sizeof(struct page)) & PAGE_SECTION_MASK;
>> -}
>> -
>> -/*
>> - * Check if this vmemmap page is already initialised.  If any section
>> + * Check if this vmemmap page is already initialised.  If any sub section
>>    * which overlaps this vmemmap page is initialised then this page is
>>    * initialised already.
>>    */
>> -static int __meminit vmemmap_populated(unsigned long start, int page_size)
>> +
>> +static int __meminit vmemmap_populated(unsigned long start, int size)
>>   {
>> -       unsigned long end = start + page_size;
>> -       start = (unsigned long)(pfn_to_page(vmemmap_section_start(start)));
>> +       unsigned long end = start + size;
>>
>> -       for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page)))
>> +       /* start is size aligned and it is always > sizeof(struct page) */
>> +       VM_BUG_ON(start & sizeof(struct page));
> 
> If start is size aligned why not include that assumption in the VM_BUG_ON()?
> 
> Otherwise it seems this patch could be reduced simply by:
> 
> s/PAGE_SECTION_MASK/PAGE_SUBSECTION_MASK/
> s/PAGES_PER_SECTION/PAGES_PER_SUBSECTION/
> 
> ...and leave the vmemmap_section_start() function in place? In other
> words this path used to guarantee that 'start' was aligned to the
> minimum mem-hotplug granularity, the change looks ok on the surface,
> but it seems a subtle change in semantics.
> 


How about this?

/*
  * Given an address within the vmemmap, determine the page that
  * represents the start of the subsection it is within.  Note that we 
have to
  * do this by hand as the proffered address may not be correctly aligned.
  * Subtraction of non-aligned pointers produces undefined results.
  */
static struct page * __meminit vmemmap_subsection_start(unsigned long 
vmemmap_addr)
{
	unsigned long start_pfn;
	unsigned long offset = vmemmap_addr - ((unsigned long)(vmemmap));

	/* Return the pfn of the start of the section. */
	start_pfn = (offset / sizeof(struct page)) & PAGE_SUBSECTION_MASK;
	return pfn_to_page(start_pfn);
}

/*
  * Since memory is added in sub-section chunks, before creating a new 
vmemmap
  * mapping, the kernel should check whether there is an existing memmap 
mapping
  * covering the new subsection added. This is needed because kernel can map
  * vmemmap area using 16MB pages which will cover a memory range of 
16G. Such
  * a range covers multiple subsections (2M)
  *
  * If any subsection in the 16G range mapped by vmemmap is valid we 
consider the
  * vmemmap populated (There is a page table entry already present). We 
can't do
  * a page table lookup here because with the hash translation we don't keep
  * vmemmap details in linux page table.
  */
static int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size)
{
	struct page *start;
	unsigned long vmemmap_end = vmemmap_addr + vmemmap_map_size;
	start = vmemmap_subsection_start(vmemmap_addr);

	for (; (unsigned long)start < vmemmap_end; start += PAGES_PER_SUBSECTION)
		/*
		 * pfn valid check here is intended to really check
		 * whether we have any subsection already initialized
		 * in this range.
		 */
		if (pfn_valid(page_to_pfn(start)))
			return 1;

	return 0;
}




> Can you get an ack from a powerpc maintainer, or maybe this patch
> should route through the powerpc tree?
> 

I guess this can go via powerpc tree. I will send an update V2 patch for 
this alone and work with Michael to pick that up.

> I'll take patch1 through the nvdimm tree.
> 

Thanks.

-aneesh




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap
  2019-09-17  7:39   ` Aneesh Kumar K.V
@ 2019-09-17 14:24     ` Dan Williams
  0 siblings, 0 replies; 14+ messages in thread
From: Dan Williams @ 2019-09-17 14:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Sachin Sant, Oliver O'Halloran, linuxppc-dev, linux-nvdimm

On Tue, Sep 17, 2019 at 12:40 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 9/16/19 11:28 PM, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> >> area. Some architectures map the memmap area with large page size. On
> >> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> >> This maps a namespace size of 16G.
> >>
> >> When populating memmap region with 16MB page from the device area,
> >> make sure the allocated space is not used to map resources outside this
> >> namespace. Such usage of device area will prevent a namespace destroy.
> >>
> >> Add resource end pnf in altmap and use that to check if the memmap area
> >> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> >> to allocation from memory.
> >>
> >> This fix kernel crash reported below:
> >>
> >> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
> >> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> >> [  133.464760] Faulting instruction address: 0xc00000000007580c
> >> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> >> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> >> .....
> >> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> >> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> >> [  133.464910] Call Trace:
> >> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
> >> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
> >> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
> >> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
> >> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
> >> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
> >> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
> >> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
> >> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> >> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> >> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> >> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
> >> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> >> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
> >
> > Question, does this crash only happen when the namespace is not 16MB
> > aligned? In other words was this bug exposed by the subsection-hotplug
> > changes and should it contain Fixes: tag for those commits?
> >
>
> We are able to hit this crash even with older kernels. This happens when
> we have multiple namespaces from the same region of size 26G. In that
> case we need to make sure we don't end up using altmap from one
> namespace for mapping vmemmap of the adjacent namespace.
>
> Considering this impacts ppc64 and we got the ppc64 SCM support in 4.20.
> may be  we can do just
> Cc: <stable@vger.kernel.org> # 4.20+

Ok, thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10  6:28 [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Aneesh Kumar K.V
2019-09-10  6:28 ` [PATCH 2/2] powerpc/nvdimm: Update vmemmap_populated to check sub-section range Aneesh Kumar K.V
2019-09-16 17:46   ` Dan Williams
2019-09-17  9:54     ` Aneesh Kumar K.V
2019-09-17  5:47   ` Oliver O'Halloran
2019-09-10  7:40 ` [PATCH 1/2] libnvdimm/altmap: Track namespace boundaries in altmap Pankaj Gupta
2019-09-10  8:10 ` Dan Williams
2019-09-10  8:30   ` Aneesh Kumar K.V
2019-09-10  9:08     ` Dan Williams
2019-09-10  8:29 ` Santosh Sivaraj
2019-09-11 12:03 ` Johannes Thumshirn
2019-09-16 17:58 ` Dan Williams
2019-09-17  7:39   ` Aneesh Kumar K.V
2019-09-17 14:24     ` Dan Williams

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org linuxppc-dev@archiver.kernel.org
	public-inbox-index linuxppc-dev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox