All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes
@ 2020-06-25  6:45 Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings Aneesh Kumar K.V
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao

This is the next version of the fixes for memory unplug on radix.
The issues and the fix are described in the actual patches.

Changes from v1:
- Added back patch to drop split_kernel_mapping
- Most of the split_kernel_mapping related issues are now described
  in the removal patch
- drop pte fragment change
- use lmb size as the max mapping size.
- Radix baremetal now use memory block size of 1G.


Changes from v0:
==============
- Rebased to latest kernel.
- Took care of p4d changes.
- Addressed Aneesh's review feedback:
 - Added comments.
 - Indentation fixed.
- Dropped the 1st patch (setting DRCONF_MEM_HOTREMOVABLE lmb flags) as
  it is debatable if this flag should be set in the device tree by OS
  and not by platform in case of hotplug. This can be looked at separately.
  (The fixes in this patchset remain valid without the dropped patch)
- Dropped the last patch that removed split_kernel_mapping() to ensure
  that spilitting code is available for any radix guest running on
  platforms that don't set DRCONF_MEM_HOTREMOVABLE.


Aneesh Kumar K.V (2):
  powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
    mappings
  powerpc/mm/radix: Create separate mappings for hot-plugged memory

Bharata B Rao (2):
  powerpc/mm/radix: Free PUD table when freeing pagetable
  powerpc/mm/radix: Remove split_kernel_mapping()

 arch/powerpc/include/asm/book3s/64/pgalloc.h |  16 +-
 arch/powerpc/mm/book3s64/pgtable.c           |   5 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c     | 199 +++++++++++--------
 arch/powerpc/mm/pgtable-frag.c               |   3 +
 arch/powerpc/platforms/powernv/setup.c       |  10 +-
 5 files changed, 144 insertions(+), 89 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings
  2020-06-25  6:45 [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes Aneesh Kumar K.V
@ 2020-06-25  6:45 ` Aneesh Kumar K.V
  2020-06-25 11:30   ` Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao

We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

Fixing this includes 3 parts:

- Re-walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment count to 1.
- When freeing PUD, PMD and PTE page table pages, check explicitly
  if they come from memblock and if so free then appropriately.
- When we do early memblock based allocation of PMD and PUD pages,
  allocate in PAGE_SIZE granularity so that we are sure the
  complete page is used as pagetable page.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 16 +++++++++++++++-
 arch/powerpc/mm/book3s64/pgtable.c           |  5 ++++-
 arch/powerpc/mm/book3s64/radix_pgtable.c     | 15 +++++++++++----
 arch/powerpc/mm/pgtable-frag.c               |  3 +++
 4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 69c5b051734f..e1af0b394ceb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -107,9 +107,23 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 	return pud;
 }
 
+static inline void __pud_free(pud_t *pud)
+{
+	struct page *page = virt_to_page(pud);
+
+	/*
+	 * Early pud pages allocated via memblock allocator
+	 * can't be directly freed to slab
+	 */
+	if (PageReserved(page))
+		free_reserved_page(page);
+	else
+		kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+}
+
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+	return __pud_free(pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index c58ad1049909..85de5b574dd7 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -339,6 +339,9 @@ void pmd_fragment_free(unsigned long *pmd)
 {
 	struct page *page = virt_to_page(pmd);
 
+	if (PageReserved(page))
+		return free_reserved_page(page);
+
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
@@ -356,7 +359,7 @@ static inline void pgtable_free(void *table, int index)
 		pmd_fragment_free(table);
 		break;
 	case PUD_INDEX:
-		kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), table);
+		__pud_free(table);
 		break;
 #if defined(CONFIG_PPC_4K_PAGES) && defined(CONFIG_HUGETLB_PAGE)
 		/* 16M hugepd directory at pud level */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 8acb96de0e48..80be96d3018f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -57,6 +57,13 @@ static __ref void *early_alloc_pgtable(unsigned long size, int nid,
 	return ptr;
 }
 
+/*
+ * When allocating pud or pmd pointers, we allocate a complete page
+ * of PAGE_SIZE rather than PUD_TABLE_SIZE or PMD_TABLE_SIZE. This
+ * is to ensure that the page obtained from the memblock allocator
+ * can be completely used as page table page and can be freed
+ * correctly when the page table entries are removed.
+ */
 static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 			  pgprot_t flags,
 			  unsigned int map_page_size,
@@ -73,8 +80,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 	pgdp = pgd_offset_k(ea);
 	p4dp = p4d_offset(pgdp, ea);
 	if (p4d_none(*p4dp)) {
-		pudp = early_alloc_pgtable(PUD_TABLE_SIZE, nid,
-						region_start, region_end);
+		pudp = early_alloc_pgtable(PAGE_SIZE, nid,
+					   region_start, region_end);
 		p4d_populate(&init_mm, p4dp, pudp);
 	}
 	pudp = pud_offset(p4dp, ea);
@@ -83,8 +90,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 		goto set_the_pte;
 	}
 	if (pud_none(*pudp)) {
-		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE, nid,
-						region_start, region_end);
+		pmdp = early_alloc_pgtable(PAGE_SIZE, nid, region_start,
+					   region_end);
 		pud_populate(&init_mm, pudp, pmdp);
 	}
 	pmdp = pmd_offset(pudp, ea);
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index ee4bd6d38602..97ae4935da79 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -110,6 +110,9 @@ void pte_fragment_free(unsigned long *table, int kernel)
 {
 	struct page *page = virt_to_page(table);
 
+	if (PageReserved(page))
+		return free_reserved_page(page);
+
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		if (!kernel)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable
  2020-06-25  6:45 [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings Aneesh Kumar K.V
@ 2020-06-25  6:45 ` Aneesh Kumar K.V
  2020-07-08  2:12   ` Michael Ellerman
  2020-07-08 20:30   ` Reza Arbab
  2020-06-25  6:45 ` [PATCH v2 3/4] powerpc/mm/radix: Remove split_kernel_mapping() Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory Aneesh Kumar K.V
  3 siblings, 2 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K . V, Bharata B Rao

From: Bharata B Rao <bharata@linux.ibm.com>

remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 80be96d3018f..af57604f295f 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -708,6 +708,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
 	pud_clear(pud);
 }
 
+static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+	pud_t *pud;
+	int i;
+
+	for (i = 0; i < PTRS_PER_PUD; i++) {
+		pud = pud_start + i;
+		if (!pud_none(*pud))
+			return;
+	}
+
+	pud_free(&init_mm, pud_start);
+	p4d_clear(p4d);
+}
+
 struct change_mapping_params {
 	pte_t *pte;
 	unsigned long start;
@@ -882,6 +897,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 
 		pud_base = (pud_t *)p4d_page_vaddr(*p4d);
 		remove_pud_table(pud_base, addr, next);
+		free_pud_table(pud_base, p4d);
 	}
 
 	spin_unlock(&init_mm.page_table_lock);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/4] powerpc/mm/radix: Remove split_kernel_mapping()
  2020-06-25  6:45 [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable Aneesh Kumar K.V
@ 2020-06-25  6:45 ` Aneesh Kumar K.V
  2020-06-25  6:45 ` [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory Aneesh Kumar K.V
  3 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K . V, Bharata B Rao

From: Bharata B Rao <bharata@linux.ibm.com>

We split the page table mapping on memory unplug if the
linear range was mapped with huge page mapping (for ex: 1G)
The page table splitting code has a few issues:

1. Recursive locking
--------------------
Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.

2. BUG: sleeping function called from in_atomic() context
---------------------------------------------------------
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()

3. Bad unlock unbalance
-----------------------
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.

These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. The kernel always make sure the minimum unplug request is
SUBSECTION_SIZE for device memory and SECTION_SIZE for regular memory.

In preparation for such a change remove page table splitting support.

This essentially is a revert of
commit 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 95 +++++-------------------
 1 file changed, 19 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index af57604f295f..78ad11812e0d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,7 +15,6 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/string_helpers.h>
-#include <linux/stop_machine.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -723,32 +722,6 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 	p4d_clear(p4d);
 }
 
-struct change_mapping_params {
-	pte_t *pte;
-	unsigned long start;
-	unsigned long end;
-	unsigned long aligned_start;
-	unsigned long aligned_end;
-};
-
-static int __meminit stop_machine_change_mapping(void *data)
-{
-	struct change_mapping_params *params =
-			(struct change_mapping_params *)data;
-
-	if (!data)
-		return -1;
-
-	spin_unlock(&init_mm.page_table_lock);
-	pte_clear(&init_mm, params->aligned_start, params->pte);
-	create_physical_mapping(__pa(params->aligned_start),
-				__pa(params->start), -1, PAGE_KERNEL);
-	create_physical_mapping(__pa(params->end), __pa(params->aligned_end),
-				-1, PAGE_KERNEL);
-	spin_lock(&init_mm.page_table_lock);
-	return 0;
-}
-
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 			     unsigned long end)
 {
@@ -777,52 +750,6 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 	}
 }
 
-/*
- * clear the pte and potentially split the mapping helper
- */
-static void __meminit split_kernel_mapping(unsigned long addr, unsigned long end,
-				unsigned long size, pte_t *pte)
-{
-	unsigned long mask = ~(size - 1);
-	unsigned long aligned_start = addr & mask;
-	unsigned long aligned_end = addr + size;
-	struct change_mapping_params params;
-	bool split_region = false;
-
-	if ((end - addr) < size) {
-		/*
-		 * We're going to clear the PTE, but not flushed
-		 * the mapping, time to remap and flush. The
-		 * effects if visible outside the processor or
-		 * if we are running in code close to the
-		 * mapping we cleared, we are in trouble.
-		 */
-		if (overlaps_kernel_text(aligned_start, addr) ||
-			overlaps_kernel_text(end, aligned_end)) {
-			/*
-			 * Hack, just return, don't pte_clear
-			 */
-			WARN_ONCE(1, "Linear mapping %lx->%lx overlaps kernel "
-				  "text, not splitting\n", addr, end);
-			return;
-		}
-		split_region = true;
-	}
-
-	if (split_region) {
-		params.pte = pte;
-		params.start = addr;
-		params.end = end;
-		params.aligned_start = addr & ~(size - 1);
-		params.aligned_end = min_t(unsigned long, aligned_end,
-				(unsigned long)__va(memblock_end_of_DRAM()));
-		stop_machine(stop_machine_change_mapping, &params, NULL);
-		return;
-	}
-
-	pte_clear(&init_mm, addr, pte);
-}
-
 static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 			     unsigned long end)
 {
@@ -838,7 +765,12 @@ static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 			continue;
 
 		if (pmd_is_leaf(*pmd)) {
-			split_kernel_mapping(addr, end, PMD_SIZE, (pte_t *)pmd);
+			if (!IS_ALIGNED(addr, PMD_SIZE) ||
+			    !IS_ALIGNED(next, PMD_SIZE)) {
+				WARN_ONCE(1, "%s: unaligned range\n", __func__);
+				continue;
+			}
+			pte_clear(&init_mm, addr, (pte_t *)pmd);
 			continue;
 		}
 
@@ -863,7 +795,12 @@ static void remove_pud_table(pud_t *pud_start, unsigned long addr,
 			continue;
 
 		if (pud_is_leaf(*pud)) {
-			split_kernel_mapping(addr, end, PUD_SIZE, (pte_t *)pud);
+			if (!IS_ALIGNED(addr, PUD_SIZE) ||
+			    !IS_ALIGNED(next, PUD_SIZE)) {
+				WARN_ONCE(1, "%s: unaligned range\n", __func__);
+				continue;
+			}
+			pte_clear(&init_mm, addr, (pte_t *)pud);
 			continue;
 		}
 
@@ -891,7 +828,13 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
 			continue;
 
 		if (p4d_is_leaf(*p4d)) {
-			split_kernel_mapping(addr, end, P4D_SIZE, (pte_t *)p4d);
+			if (!IS_ALIGNED(addr, P4D_SIZE) ||
+			    !IS_ALIGNED(next, P4D_SIZE)) {
+				WARN_ONCE(1, "%s: unaligned range\n", __func__);
+				continue;
+			}
+
+			pte_clear(&init_mm, addr, (pte_t *)pgd);
 			continue;
 		}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
  2020-06-25  6:45 [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2020-06-25  6:45 ` [PATCH v2 3/4] powerpc/mm/radix: Remove split_kernel_mapping() Aneesh Kumar K.V
@ 2020-06-25  6:45 ` Aneesh Kumar K.V
  2020-07-08  4:44   ` Michael Ellerman
  3 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-06-25  6:45 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao

To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.

This implies on pseries system, we now end up mapping
memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range.  This was added that as part of

commit b6eca183e23e ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")

But we still don't do that on PowerVM. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 83 ++++++++++++++++++++----
 arch/powerpc/platforms/powernv/setup.c   | 10 ++-
 2 files changed, 81 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 78ad11812e0d..a4179e4da49d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/string_helpers.h>
+#include <linux/memory.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -34,6 +35,7 @@
 
 unsigned int mmu_pid_bits;
 unsigned int mmu_base_pid;
+unsigned int radix_mem_block_size;
 
 static __ref void *early_alloc_pgtable(unsigned long size, int nid,
 			unsigned long region_start, unsigned long region_end)
@@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end)
 
 static int __meminit create_physical_mapping(unsigned long start,
 					     unsigned long end,
+					     unsigned long max_mapping_size,
 					     int nid, pgprot_t _prot)
 {
 	unsigned long vaddr, addr, mapping_size = 0;
@@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned long start,
 		int rc;
 
 		gap = next_boundary(addr, end) - addr;
+		if (gap > max_mapping_size)
+			gap = max_mapping_size;
 		previous_size = mapping_size;
 		prev_exec = exec;
 
@@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void)
 
 	/* We don't support slb for radix */
 	mmu_slb_size = 0;
+
 	/*
-	 * Create the linear mapping, using standard page size for now
+	 * Create the linear mapping
 	 */
 	for_each_memblock(memory, reg) {
 		/*
@@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void)
 
 		WARN_ON(create_physical_mapping(reg->base,
 						reg->base + reg->size,
+						radix_mem_block_size,
 						-1, PAGE_KERNEL));
 	}
 
@@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long node,
 	return 1;
 }
 
+static int __init probe_memory_block_size(unsigned long node, const char *uname, int
+					  depth, void *data)
+{
+	const __be32 *block_size;
+	int len;
+
+	if (depth != 1)
+		return 0;
+
+	if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) {
+
+		block_size = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
+		if (!block_size || len < dt_root_size_cells * sizeof(__be32))
+			/*
+			 * Nothing in the device tree
+			 */
+			return MIN_MEMORY_BLOCK_SIZE;
+
+		return dt_mem_next_cell(dt_root_size_cells, &block_size);
+
+	}
+
+	return 0;
+}
+
+static unsigned long radix_memory_block_size(void)
+{
+	unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+
+	if (firmware_has_feature(FW_FEATURE_OPAL)) {
+
+		mem_block_size = 1UL * 1024 * 1024 * 1024;
+
+	} else if (firmware_has_feature(FW_FEATURE_LPAR)) {
+		mem_block_size = of_scan_flat_dt(probe_memory_block_size, NULL);
+		if (!mem_block_size)
+			mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+	}
+
+	return mem_block_size;
+}
+
+
 void __init radix__early_init_devtree(void)
 {
 	int rc;
@@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void)
 	 * Try to find the available page sizes in the device-tree
 	 */
 	rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL);
-	if (rc != 0)  /* Found */
-		goto found;
+	if (rc == 0) {
+		/*
+		 * no page size details found in device tree
+		 * let's assume we have page 4k and 64k support
+		 */
+		mmu_psize_defs[MMU_PAGE_4K].shift = 12;
+		mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
+
+		mmu_psize_defs[MMU_PAGE_64K].shift = 16;
+		mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
+	}
+
 	/*
-	 * let's assume we have page 4k and 64k support
+	 * Max mapping size used when mapping pages. We don't use
+	 * ppc_md.memory_block_size() here because this get called
+	 * early and we don't have machine probe called yet. Also
+	 * the pseries implementation only check for ibm,lmb-size.
+	 * All hypervisor supporting radix do expose that device
+	 * tree node.
 	 */
-	mmu_psize_defs[MMU_PAGE_4K].shift = 12;
-	mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
-
-	mmu_psize_defs[MMU_PAGE_64K].shift = 16;
-	mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
-found:
+	radix_mem_block_size = radix_memory_block_size();
 	return;
 }
 
@@ -856,7 +916,8 @@ int __meminit radix__create_section_mapping(unsigned long start,
 		return -1;
 	}
 
-	return create_physical_mapping(__pa(start), __pa(end), nid, prot);
+	return create_physical_mapping(__pa(start), __pa(end),
+				       radix_mem_block_size, nid, prot);
 }
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 3bc188da82ba..6e2f614590a3 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -399,7 +399,15 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 static unsigned long pnv_memory_block_size(void)
 {
-	return 256UL * 1024 * 1024;
+	/*
+	 * We map the kernel linear region with 1GB large pages on radix. For
+	 * memory hot unplug to work our memory block size must be at least
+	 * this size.
+	 */
+	if (radix_enabled())
+		return 1UL * 1024 * 1024 * 1024;
+	else
+		return 256UL * 1024 * 1024;
 }
 #endif
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings
  2020-06-25  6:45 ` [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings Aneesh Kumar K.V
@ 2020-06-25 11:30   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-06-25 11:30 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Bharata B Rao

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:


> Fixing this includes 3 parts:
>
> - Re-walk the init_mm page tables from mem_init() and initialize
>   the PMD and PTE fragment count to 1.
> - When freeing PUD, PMD and PTE page table pages, check explicitly
>   if they come from memblock and if so free then appropriately.
> - When we do early memblock based allocation of PMD and PUD pages,
>   allocate in PAGE_SIZE granularity so that we are sure the
>   complete page is used as pagetable page.
>
> Since we now do PAGE_SIZE allocations for both PUD table and
> PMD table (Note that PTE table allocation is already of PAGE_SIZE),
> we end up allocating more memory for the same amount of system RAM.
> Here is a comparision of how much more we need for a 64T and 2G
> system after this patch:
>

Missed updating the commit message w.r.t page table fragments.  Updated
one below.

powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable
  2020-06-25  6:45 ` [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable Aneesh Kumar K.V
@ 2020-07-08  2:12   ` Michael Ellerman
  2020-07-08  2:48     ` Aneesh Kumar K.V
  2020-07-08 20:30   ` Reza Arbab
  1 sibling, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2020-07-08  2:12 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev; +Cc: Aneesh Kumar K . V, Bharata B Rao

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> From: Bharata B Rao <bharata@linux.ibm.com>
>
> remove_pagetable() isn't freeing PUD table. This causes memory
> leak during memory unplug. Fix this.

Fixes: ??

cheers

> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 80be96d3018f..af57604f295f 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -708,6 +708,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>  	pud_clear(pud);
>  }
>  
> +static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (!pud_none(*pud))
> +			return;
> +	}
> +
> +	pud_free(&init_mm, pud_start);
> +	p4d_clear(p4d);
> +}
> +
>  struct change_mapping_params {
>  	pte_t *pte;
>  	unsigned long start;
> @@ -882,6 +897,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
>  
>  		pud_base = (pud_t *)p4d_page_vaddr(*p4d);
>  		remove_pud_table(pud_base, addr, next);
> +		free_pud_table(pud_base, p4d);
>  	}
>  
>  	spin_unlock(&init_mm.page_table_lock);
> -- 
> 2.26.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable
  2020-07-08  2:12   ` Michael Ellerman
@ 2020-07-08  2:48     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-07-08  2:48 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: Bharata B Rao

On 7/8/20 7:42 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> From: Bharata B Rao <bharata@linux.ibm.com>
>>
>> remove_pagetable() isn't freeing PUD table. This causes memory
>> leak during memory unplug. Fix this.
> 

Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()")


> Fixes: ??
> 
> cheers
> 
>> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/mm/book3s64/radix_pgtable.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index 80be96d3018f..af57604f295f 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -708,6 +708,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>>   	pud_clear(pud);
>>   }
>>   
>> +static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
>> +{
>> +	pud_t *pud;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (!pud_none(*pud))
>> +			return;
>> +	}
>> +
>> +	pud_free(&init_mm, pud_start);
>> +	p4d_clear(p4d);
>> +}
>> +
>>   struct change_mapping_params {
>>   	pte_t *pte;
>>   	unsigned long start;
>> @@ -882,6 +897,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
>>   
>>   		pud_base = (pud_t *)p4d_page_vaddr(*p4d);
>>   		remove_pud_table(pud_base, addr, next);
>> +		free_pud_table(pud_base, p4d);
>>   	}
>>   
>>   	spin_unlock(&init_mm.page_table_lock);
>> -- 
>> 2.26.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
  2020-06-25  6:45 ` [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory Aneesh Kumar K.V
@ 2020-07-08  4:44   ` Michael Ellerman
  2020-07-08  6:24     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2020-07-08  4:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev; +Cc: Aneesh Kumar K.V, Bharata B Rao

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> To enable memory unplug without splitting kernel page table
> mapping, we force the max mapping size to the LMB size. LMB
> size is the unit in which hypervisor will do memory add/remove
> operation.
>
> This implies on pseries system, we now end up mapping

Please expand on why it "implies" that for pseries.

> memory with 2M page size instead of 1G. To improve
> that we want hypervisor to hint the kernel about the hotplug
> memory range.  This was added that as part of
                 That
>
> commit b6eca183e23e ("powerpc/kernel: Enables memory
> hot-remove after reboot on pseries guests")
>
> But we still don't do that on PowerVM. Once we get PowerVM

I think you mean PowerVM doesn't provide that hint yet?

Realistically it won't until P10. So this means we'll always use 2MB on
Power9 PowerVM doesn't it?

What about KVM?

Have you done any benchmarking on the impact of switching the linear
mapping to 2MB pages?

> updated, we can then force the 2M mapping only to hot-pluggable
> memory region using memblock_is_hotpluggable(). Till then
> let's depend on LMB size for finding the mapping page size
> for linear range.
>
> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 83 ++++++++++++++++++++----
>  arch/powerpc/platforms/powernv/setup.c   | 10 ++-
>  2 files changed, 81 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 78ad11812e0d..a4179e4da49d 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -15,6 +15,7 @@
>  #include <linux/mm.h>
>  #include <linux/hugetlb.h>
>  #include <linux/string_helpers.h>
> +#include <linux/memory.h>
>  
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
> @@ -34,6 +35,7 @@
>  
>  unsigned int mmu_pid_bits;
>  unsigned int mmu_base_pid;
> +unsigned int radix_mem_block_size;

static? __ro_after_init?

> @@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end)
>  
>  static int __meminit create_physical_mapping(unsigned long start,
>  					     unsigned long end,
> +					     unsigned long max_mapping_size,
>  					     int nid, pgprot_t _prot)
>  {
>  	unsigned long vaddr, addr, mapping_size = 0;
> @@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned long start,
>  		int rc;
>  
>  		gap = next_boundary(addr, end) - addr;
> +		if (gap > max_mapping_size)
> +			gap = max_mapping_size;
>  		previous_size = mapping_size;
>  		prev_exec = exec;
>  
> @@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void)
>  
>  	/* We don't support slb for radix */
>  	mmu_slb_size = 0;
> +
>  	/*
> -	 * Create the linear mapping, using standard page size for now
> +	 * Create the linear mapping
>  	 */
>  	for_each_memblock(memory, reg) {
>  		/*
> @@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void)
>  
>  		WARN_ON(create_physical_mapping(reg->base,
>  						reg->base + reg->size,
> +						radix_mem_block_size,
>  						-1, PAGE_KERNEL));
>  	}
>  
> @@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long node,
>  	return 1;
>  }
>  
> +static int __init probe_memory_block_size(unsigned long node, const char *uname, int
> +					  depth, void *data)
> +{
> +	const __be32 *block_size;
> +	int len;
> +
> +	if (depth != 1)
> +		return 0;
> +
> +	if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) {

	if (strcmp(uname, "ibm,dynamic-reconfiguration-memory") != 0)
        	return 0;

Then you can de-indent the rest of the body.


> +		block_size = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
> +		if (!block_size || len < dt_root_size_cells * sizeof(__be32))

This will give us a sparse warning, please do the endian conversion
before looking at the value.

> +			/*
> +			 * Nothing in the device tree
> +			 */
> +			return MIN_MEMORY_BLOCK_SIZE;

> +
> +		return dt_mem_next_cell(dt_root_size_cells, &block_size);

Just of_read_number() ?

This is misusing the return value, as I explained on one of your other
recent patches. You should return !0 to indicate that scanning should
stop, and the actual value can go via the data pointer, or better just
set the global.


> +	}
> +
> +	return 0;
> +}
> +
> +static unsigned long radix_memory_block_size(void)
> +{
> +	unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;

> +
> +	if (firmware_has_feature(FW_FEATURE_OPAL)) {
> +
> +		mem_block_size = 1UL * 1024 * 1024 * 1024;

We have 1GB hardcoded here and also in pnv_memory_block_size().

Shouldn't pnv_memory_block_size() at least be using radix_mem_block_size?

> +	} else if (firmware_has_feature(FW_FEATURE_LPAR)) {
> +		mem_block_size = of_scan_flat_dt(probe_memory_block_size, NULL);
> +		if (!mem_block_size)
> +			mem_block_size = MIN_MEMORY_BLOCK_SIZE;
> +	}
> +
> +	return mem_block_size;
> +}

It would probably be simpler if that was just inlined below.

> +
> +
>  void __init radix__early_init_devtree(void)
>  {
>  	int rc;
> @@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void)
>  	 * Try to find the available page sizes in the device-tree
>  	 */
>  	rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL);
> -	if (rc != 0)  /* Found */
> -		goto found;
> +	if (rc == 0) {
> +		/*
> +		 * no page size details found in device tree
> +		 * let's assume we have page 4k and 64k support

Capitals and punctuation please?

> +		 */
> +		mmu_psize_defs[MMU_PAGE_4K].shift = 12;
> +		mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
> +
> +		mmu_psize_defs[MMU_PAGE_64K].shift = 16;
> +		mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
> +	}

Moving that seems like an unrelated change. It's a reasonable change but
I'd rather you did it in a standalone patch.

>  	/*
> -	 * let's assume we have page 4k and 64k support
> +	 * Max mapping size used when mapping pages. We don't use
> +	 * ppc_md.memory_block_size() here because this get called
> +	 * early and we don't have machine probe called yet. Also
> +	 * the pseries implementation only check for ibm,lmb-size.
> +	 * All hypervisor supporting radix do expose that device
> +	 * tree node.
>  	 */
> -	mmu_psize_defs[MMU_PAGE_4K].shift = 12;
> -	mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
> -
> -	mmu_psize_defs[MMU_PAGE_64K].shift = 16;
> -	mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
> -found:
> +	radix_mem_block_size = radix_memory_block_size();

If you did that earlier in the function, before
radix_dt_scan_page_sizes(), the logic would be simpler.

>  	return;
>  }
>  
> @@ -856,7 +916,8 @@ int __meminit radix__create_section_mapping(unsigned long start,
>  		return -1;
>  	}
>  
> -	return create_physical_mapping(__pa(start), __pa(end), nid, prot);
> +	return create_physical_mapping(__pa(start), __pa(end),
> +				       radix_mem_block_size, nid, prot);
>  }
>  
>  int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> index 3bc188da82ba..6e2f614590a3 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -399,7 +399,15 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
>  #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
>  static unsigned long pnv_memory_block_size(void)
>  {
> -	return 256UL * 1024 * 1024;
> +	/*
> +	 * We map the kernel linear region with 1GB large pages on radix. For
> +	 * memory hot unplug to work our memory block size must be at least
> +	 * this size.
> +	 */
> +	if (radix_enabled())
> +		return 1UL * 1024 * 1024 * 1024;
> +	else
> +		return 256UL * 1024 * 1024;
>  }
>  #endif
>  
> -- 
> 2.26.2


cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
  2020-07-08  4:44   ` Michael Ellerman
@ 2020-07-08  6:24     ` Aneesh Kumar K.V
  2020-07-08 12:14       ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2020-07-08  6:24 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: Bharata B Rao

On 7/8/20 10:14 AM, Michael Ellerman wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>> To enable memory unplug without splitting kernel page table
>> mapping, we force the max mapping size to the LMB size. LMB
>> size is the unit in which hypervisor will do memory add/remove
>> operation.
>>
>> This implies on pseries system, we now end up mapping
> 
> Please expand on why it "implies" that for pseries.
> 
>> memory with 2M page size instead of 1G. To improve
>> that we want hypervisor to hint the kernel about the hotplug
>> memory range.  This was added that as part of
>                   That
>>
>> commit b6eca183e23e ("powerpc/kernel: Enables memory
>> hot-remove after reboot on pseries guests")
>>
>> But we still don't do that on PowerVM. Once we get PowerVM
> 
> I think you mean PowerVM doesn't provide that hint yet?
> 
> Realistically it won't until P10. So this means we'll always use 2MB on
> Power9 PowerVM doesn't it?
> 
> What about KVM?
> 
> Have you done any benchmarking on the impact of switching the linear
> mapping to 2MB pages?
> 

The TLB impact should be minimal because with a 256M LMB size partition 
scoped entries are still 2M and hence we end up with TLBs of 2M size.


>> updated, we can then force the 2M mapping only to hot-pluggable
>> memory region using memblock_is_hotpluggable(). Till then
>> let's depend on LMB size for finding the mapping page size
>> for linear range.
>>

updated


powerpc/mm/radix: Create separate mappings for hot-plugged memory

To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.

Pseries systems supports max LMB size of 256MB. Hence on pseries,
we now end up mapping memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range.  That was added that as part of

commit b6eca18 ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")

But PowerVM doesn't provide that hint yet. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.

With this change KVM guest will also be doing linear mapping with
2M page size.



>> Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   arch/powerpc/mm/book3s64/radix_pgtable.c | 83 ++++++++++++++++++++----
>>   arch/powerpc/platforms/powernv/setup.c   | 10 ++-
>>   2 files changed, 81 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index 78ad11812e0d..a4179e4da49d 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/hugetlb.h>
>>   #include <linux/string_helpers.h>
>> +#include <linux/memory.h>
>>   
>>   #include <asm/pgtable.h>
>>   #include <asm/pgalloc.h>
>> @@ -34,6 +35,7 @@
>>   
>>   unsigned int mmu_pid_bits;
>>   unsigned int mmu_base_pid;
>> +unsigned int radix_mem_block_size;
> 
> static? __ro_after_init?
> 

Added __ro_after_iit

>> @@ -266,6 +268,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end)
>>   
>>   static int __meminit create_physical_mapping(unsigned long start,
>>   					     unsigned long end,
>> +					     unsigned long max_mapping_size,
>>   					     int nid, pgprot_t _prot)
>>   {
>>   	unsigned long vaddr, addr, mapping_size = 0;
>> @@ -279,6 +282,8 @@ static int __meminit create_physical_mapping(unsigned long start,
>>   		int rc;
>>   
>>   		gap = next_boundary(addr, end) - addr;
>> +		if (gap > max_mapping_size)
>> +			gap = max_mapping_size;
>>   		previous_size = mapping_size;
>>   		prev_exec = exec;
>>   
>> @@ -329,8 +334,9 @@ static void __init radix_init_pgtable(void)
>>   
>>   	/* We don't support slb for radix */
>>   	mmu_slb_size = 0;
>> +
>>   	/*
>> -	 * Create the linear mapping, using standard page size for now
>> +	 * Create the linear mapping
>>   	 */
>>   	for_each_memblock(memory, reg) {
>>   		/*
>> @@ -346,6 +352,7 @@ static void __init radix_init_pgtable(void)
>>   
>>   		WARN_ON(create_physical_mapping(reg->base,
>>   						reg->base + reg->size,
>> +						radix_mem_block_size,
>>   						-1, PAGE_KERNEL));
>>   	}
>>   
>> @@ -486,6 +493,49 @@ static int __init radix_dt_scan_page_sizes(unsigned long node,
>>   	return 1;
>>   }
>>   
>> +static int __init probe_memory_block_size(unsigned long node, const char *uname, int
>> +					  depth, void *data)
>> +{
>> +	const __be32 *block_size;
>> +	int len;
>> +
>> +	if (depth != 1)
>> +		return 0;
>> +
>> +	if (!strcmp(uname, "ibm,dynamic-reconfiguration-memory")) {
> 
> 	if (strcmp(uname, "ibm,dynamic-reconfiguration-memory") != 0)
>          	return 0;
> 
> Then you can de-indent the rest of the body.
> 

updated

> 
>> +		block_size = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
>> +		if (!block_size || len < dt_root_size_cells * sizeof(__be32))
> 
> This will give us a sparse warning, please do the endian conversion
> before looking at the value.
> 
>> +			/*
>> +			 * Nothing in the device tree
>> +			 */
>> +			return MIN_MEMORY_BLOCK_SIZE;
> 
>> +
>> +		return dt_mem_next_cell(dt_root_size_cells, &block_size);
> 
> Just of_read_number() ?
> 
> This is misusing the return value, as I explained on one of your other
> recent patches. You should return !0 to indicate that scanning should
> stop, and the actual value can go via the data pointer, or better just
> set the global.
> 


updated

> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static unsigned long radix_memory_block_size(void)
>> +{
>> +	unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
> 
>> +
>> +	if (firmware_has_feature(FW_FEATURE_OPAL)) {
>> +
>> +		mem_block_size = 1UL * 1024 * 1024 * 1024;
> 
> We have 1GB hardcoded here and also in pnv_memory_block_size().
> 
> Shouldn't pnv_memory_block_size() at least be using radix_mem_block_size?


updated

> 
>> +	} else if (firmware_has_feature(FW_FEATURE_LPAR)) {
>> +		mem_block_size = of_scan_flat_dt(probe_memory_block_size, NULL);
>> +		if (!mem_block_size)
>> +			mem_block_size = MIN_MEMORY_BLOCK_SIZE;
>> +	}
>> +
>> +	return mem_block_size;
>> +}
> 
> It would probably be simpler if that was just inlined below.
> 
>> +
>> +
>>   void __init radix__early_init_devtree(void)
>>   {
>>   	int rc;
>> @@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void)
>>   	 * Try to find the available page sizes in the device-tree
>>   	 */
>>   	rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL);
>> -	if (rc != 0)  /* Found */
>> -		goto found;
>> +	if (rc == 0) {
>> +		/*
>> +		 * no page size details found in device tree
>> +		 * let's assume we have page 4k and 64k support
> 
> Capitals and punctuation please?
> 
>> +		 */
>> +		mmu_psize_defs[MMU_PAGE_4K].shift = 12;
>> +		mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
>> +
>> +		mmu_psize_defs[MMU_PAGE_64K].shift = 16;
>> +		mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
>> +	}
> 
> Moving that seems like an unrelated change. It's a reasonable change but
> I'd rather you did it in a standalone patch.
> 

we needed that change so that we can call radix_memory_block_size() for 
both found and !found case.

>>   	/*
>> -	 * let's assume we have page 4k and 64k support
>> +	 * Max mapping size used when mapping pages. We don't use
>> +	 * ppc_md.memory_block_size() here because this get called
>> +	 * early and we don't have machine probe called yet. Also
>> +	 * the pseries implementation only check for ibm,lmb-size.
>> +	 * All hypervisor supporting radix do expose that device
>> +	 * tree node.
>>   	 */
>> -	mmu_psize_defs[MMU_PAGE_4K].shift = 12;
>> -	mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
>> -
>> -	mmu_psize_defs[MMU_PAGE_64K].shift = 16;
>> -	mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
>> -found:
>> +	radix_mem_block_size = radix_memory_block_size();
> 
> If you did that earlier in the function, before
> radix_dt_scan_page_sizes(), the logic would be simpler.
> 
>>   	return;
>>   }
>>   
>> @@ -856,7 +916,8 @@ int __meminit radix__create_section_mapping(unsigned long start,
>>   		return -1;
>>   	}
>>   
>> -	return create_physical_mapping(__pa(start), __pa(end), nid, prot);
>> +	return create_physical_mapping(__pa(start), __pa(end),
>> +				       radix_mem_block_size, nid, prot);
>>   }
>>   
>>   int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
>> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
>> index 3bc188da82ba..6e2f614590a3 100644
>> --- a/arch/powerpc/platforms/powernv/setup.c
>> +++ b/arch/powerpc/platforms/powernv/setup.c
>> @@ -399,7 +399,15 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
>>   #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
>>   static unsigned long pnv_memory_block_size(void)
>>   {
>> -	return 256UL * 1024 * 1024;
>> +	/*
>> +	 * We map the kernel linear region with 1GB large pages on radix. For
>> +	 * memory hot unplug to work our memory block size must be at least
>> +	 * this size.
>> +	 */
>> +	if (radix_enabled())
>> +		return 1UL * 1024 * 1024 * 1024;
>> +	else
>> +		return 256UL * 1024 * 1024;
>>   }
>>   #endif
>>   
>> -- 
>> 2.26.2
> 
> 
> cheers
> 


-aneesh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
  2020-07-08  6:24     ` Aneesh Kumar K.V
@ 2020-07-08 12:14       ` Michael Ellerman
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2020-07-08 12:14 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev; +Cc: Bharata B Rao

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> On 7/8/20 10:14 AM, Michael Ellerman wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>> To enable memory unplug without splitting kernel page table
>>> mapping, we force the max mapping size to the LMB size. LMB
>>> size is the unit in which hypervisor will do memory add/remove
>>> operation.
>>>
>>> This implies on pseries system, we now end up mapping
>> 
>> Please expand on why it "implies" that for pseries.
>> 
>>> memory with 2M page size instead of 1G. To improve
>>> that we want hypervisor to hint the kernel about the hotplug
>>> memory range.  This was added that as part of
>>                   That
>>>
>>> commit b6eca183e23e ("powerpc/kernel: Enables memory
>>> hot-remove after reboot on pseries guests")
>>>
>>> But we still don't do that on PowerVM. Once we get PowerVM
>> 
>> I think you mean PowerVM doesn't provide that hint yet?
>> 
>> Realistically it won't until P10. So this means we'll always use 2MB on
>> Power9 PowerVM doesn't it?
>> 
>> What about KVM?
>> 
>> Have you done any benchmarking on the impact of switching the linear
>> mapping to 2MB pages?
>> 
>
> The TLB impact should be minimal because with a 256M LMB size partition 
> scoped entries are still 2M and hence we end up with TLBs of 2M size.
>
>
>>> updated, we can then force the 2M mapping only to hot-pluggable
>>> memory region using memblock_is_hotpluggable(). Till then
>>> let's depend on LMB size for finding the mapping page size
>>> for linear range.
>>>
>
> updated
>
>
> powerpc/mm/radix: Create separate mappings for hot-plugged memory
>
> To enable memory unplug without splitting kernel page table
> mapping, we force the max mapping size to the LMB size. LMB
> size is the unit in which hypervisor will do memory add/remove
> operation.
>
> Pseries systems supports max LMB size of 256MB. Hence on pseries,
> we now end up mapping memory with 2M page size instead of 1G. To improve
> that we want hypervisor to hint the kernel about the hotplug
> memory range.  That was added that as part of
>
> commit b6eca18 ("powerpc/kernel: Enables memory
> hot-remove after reboot on pseries guests")
>
> But PowerVM doesn't provide that hint yet. Once we get PowerVM
> updated, we can then force the 2M mapping only to hot-pluggable
> memory region using memblock_is_hotpluggable(). Till then
> let's depend on LMB size for finding the mapping page size
> for linear range.
>
> With this change KVM guest will also be doing linear mapping with
> 2M page size.

...
>>> @@ -494,17 +544,27 @@ void __init radix__early_init_devtree(void)
>>>   	 * Try to find the available page sizes in the device-tree
>>>   	 */
>>>   	rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL);
>>> -	if (rc != 0)  /* Found */
>>> -		goto found;
>>> +	if (rc == 0) {
>>> +		/*
>>> +		 * no page size details found in device tree
>>> +		 * let's assume we have page 4k and 64k support
>> 
>> Capitals and punctuation please?
>> 
>>> +		 */
>>> +		mmu_psize_defs[MMU_PAGE_4K].shift = 12;
>>> +		mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
>>> +
>>> +		mmu_psize_defs[MMU_PAGE_64K].shift = 16;
>>> +		mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
>>> +	}
>> 
>> Moving that seems like an unrelated change. It's a reasonable change but
>> I'd rather you did it in a standalone patch.
>> 
>
> we needed that change so that we can call radix_memory_block_size() for 
> both found and !found case.

But the found and !found cases converge at found:, which is where you
call it. So I don't understand.

But as I said below, it would be even simpler if you worked out the
memory block size first.

cheers

>>>   	/*
>>> -	 * let's assume we have page 4k and 64k support
>>> +	 * Max mapping size used when mapping pages. We don't use
>>> +	 * ppc_md.memory_block_size() here because this get called
>>> +	 * early and we don't have machine probe called yet. Also
>>> +	 * the pseries implementation only check for ibm,lmb-size.
>>> +	 * All hypervisor supporting radix do expose that device
>>> +	 * tree node.
>>>   	 */
>>> -	mmu_psize_defs[MMU_PAGE_4K].shift = 12;
>>> -	mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
>>> -
>>> -	mmu_psize_defs[MMU_PAGE_64K].shift = 16;
>>> -	mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
>>> -found:
>>> +	radix_mem_block_size = radix_memory_block_size();
>> 
>> If you did that earlier in the function, before
>> radix_dt_scan_page_sizes(), the logic would be simpler.
>> 
>>>   	return;
>>>   }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable
  2020-06-25  6:45 ` [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable Aneesh Kumar K.V
  2020-07-08  2:12   ` Michael Ellerman
@ 2020-07-08 20:30   ` Reza Arbab
  1 sibling, 0 replies; 12+ messages in thread
From: Reza Arbab @ 2020-07-08 20:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, Bharata B Rao

On Thu, Jun 25, 2020 at 12:15:45PM +0530, Aneesh Kumar K.V wrote:
>remove_pagetable() isn't freeing PUD table. This causes memory
>leak during memory unplug. Fix this.

This has come up before:
https://lore.kernel.org/linuxppc-dev/20190731061920.GA18807@in.ibm.com/

tl;dr, x86 intentionally does not free, and it wasn't quite clear if 
their motivation also applies to us. Probably not, but I thought it was 
worth mentioning again.

-- 
Reza Arbab

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-07-08 20:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25  6:45 [PATCH v2 0/4] powerpc/mm/radix: Memory unplug fixes Aneesh Kumar K.V
2020-06-25  6:45 ` [PATCH v2 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings Aneesh Kumar K.V
2020-06-25 11:30   ` Aneesh Kumar K.V
2020-06-25  6:45 ` [PATCH v2 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable Aneesh Kumar K.V
2020-07-08  2:12   ` Michael Ellerman
2020-07-08  2:48     ` Aneesh Kumar K.V
2020-07-08 20:30   ` Reza Arbab
2020-06-25  6:45 ` [PATCH v2 3/4] powerpc/mm/radix: Remove split_kernel_mapping() Aneesh Kumar K.V
2020-06-25  6:45 ` [PATCH v2 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory Aneesh Kumar K.V
2020-07-08  4:44   ` Michael Ellerman
2020-07-08  6:24     ` Aneesh Kumar K.V
2020-07-08 12:14       ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.