All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables
@ 2018-02-13 15:08 Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 01/14] powerpc/64s: do not allocate lppaca if we are not virtualized Nicholas Piggin
                   ` (14 more replies)
  0 siblings, 15 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This series allows numa aware allocations for various early data
structures for radix. Hash still has a bolted SLB limitation that
prevents at least pacas and stacks from node-affine allocations.

Fixed up a number of bugs, got pSeries working, added a couple more
cases where page tables can be allocated node-local.

Thanks,
Nick

Nicholas Piggin (14):
  powerpc/64s: do not allocate lppaca if we are not virtualized
  powerpc/64: Use array of paca pointers and allocate pacas individually
  powerpc/64s: allocate lppacas individually
  powerpc/64s: allocate slb_shadow structures individually
  mm: make memblock_alloc_base_nid non-static
  powerpc/mm/numa: move numa topology discovery earlier
  powerpc/64: move default SPR recording
  powerpc/setup: cpu_to_phys_id array
  powerpc/64: defer paca allocation until memory topology is discovered
  powerpc/64: allocate pacas per node
  powerpc/64: allocate per-cpu stacks node-local if possible
  powerpc: pass node id into create_section_mapping
  powerpc/64s/radix: split early page table mapping to its own function
  powerpc/64s/radix: allocate kernel page tables node-local if possible

 arch/powerpc/include/asm/book3s/64/hash.h    |   2 +-
 arch/powerpc/include/asm/book3s/64/radix.h   |   2 +-
 arch/powerpc/include/asm/kvm_ppc.h           |   8 +-
 arch/powerpc/include/asm/lppaca.h            |  26 +--
 arch/powerpc/include/asm/paca.h              |  16 +-
 arch/powerpc/include/asm/pmc.h               |  13 +-
 arch/powerpc/include/asm/setup.h             |   1 +
 arch/powerpc/include/asm/smp.h               |   5 +-
 arch/powerpc/include/asm/sparsemem.h         |   2 +-
 arch/powerpc/kernel/asm-offsets.c            |   5 +
 arch/powerpc/kernel/crash.c                  |   2 +-
 arch/powerpc/kernel/head_64.S                |  19 ++-
 arch/powerpc/kernel/machine_kexec_64.c       |  37 +++--
 arch/powerpc/kernel/paca.c                   | 238 ++++++++++++++-------------
 arch/powerpc/kernel/prom.c                   |  12 +-
 arch/powerpc/kernel/setup-common.c           |  30 +++-
 arch/powerpc/kernel/setup.h                  |   9 +-
 arch/powerpc/kernel/setup_64.c               |  80 ++++++---
 arch/powerpc/kernel/smp.c                    |  10 +-
 arch/powerpc/kernel/sysfs.c                  |  18 +-
 arch/powerpc/kvm/book3s_hv.c                 |  34 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c         |   2 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S      |   3 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S      |   3 +-
 arch/powerpc/mm/hash_utils_64.c              |   2 +-
 arch/powerpc/mm/mem.c                        |   9 +-
 arch/powerpc/mm/numa.c                       |  36 ++--
 arch/powerpc/mm/pgtable-book3s64.c           |   6 +-
 arch/powerpc/mm/pgtable-radix.c              | 203 ++++++++++++++++-------
 arch/powerpc/mm/tlb-radix.c                  |   2 +-
 arch/powerpc/platforms/85xx/smp.c            |   8 +-
 arch/powerpc/platforms/cell/smp.c            |   4 +-
 arch/powerpc/platforms/powernv/idle.c        |  13 +-
 arch/powerpc/platforms/powernv/setup.c       |   4 +-
 arch/powerpc/platforms/powernv/smp.c         |   2 +-
 arch/powerpc/platforms/powernv/subcore.c     |   2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   2 +-
 arch/powerpc/platforms/pseries/kexec.c       |   7 +-
 arch/powerpc/platforms/pseries/lpar.c        |   4 +-
 arch/powerpc/platforms/pseries/setup.c       |   2 +-
 arch/powerpc/platforms/pseries/smp.c         |   4 +-
 arch/powerpc/sysdev/xics/icp-native.c        |   2 +-
 arch/powerpc/xmon/xmon.c                     |   2 +-
 include/linux/memblock.h                     |   5 +-
 mm/memblock.c                                |   2 +-
 45 files changed, 543 insertions(+), 355 deletions(-)

-- 
2.16.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/14] powerpc/64s: do not allocate lppaca if we are not virtualized
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-31 14:03   ` [01/14] " Michael Ellerman
  2018-02-13 15:08 ` [PATCH 02/14] powerpc/64: Use array of paca pointers and allocate pacas individually Nicholas Piggin
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

The "lppaca" is a structure registered with the hypervisor. This
is unnecessary when running on non-virtualised platforms. One field
from the lppaca (pmcregs_in_use) is also used by the host, so move
the host part out into the paca (lppaca field is still updated in
guest mode).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/paca.h         |  9 +++++++--
 arch/powerpc/include/asm/pmc.h          | 13 ++++++++++++-
 arch/powerpc/kernel/asm-offsets.c       |  5 +++++
 arch/powerpc/kernel/paca.c              | 16 +++++++++++++---
 arch/powerpc/kvm/book3s_hv_interrupts.S |  3 +--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  3 +--
 6 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index b62c31037cad..57fe8aa0c257 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -58,7 +58,7 @@ struct task_struct;
  * processor.
  */
 struct paca_struct {
-#ifdef CONFIG_PPC_BOOK3S
+#ifdef CONFIG_PPC_PSERIES
 	/*
 	 * Because hw_cpu_id, unlike other paca fields, is accessed
 	 * routinely from other CPUs (from the IRQ code), we stick to
@@ -67,7 +67,8 @@ struct paca_struct {
 	 */
 
 	struct lppaca *lppaca_ptr;	/* Pointer to LpPaca for PLIC */
-#endif /* CONFIG_PPC_BOOK3S */
+#endif /* CONFIG_PPC_PSERIES */
+
 	/*
 	 * MAGIC: the spinlock functions in arch/powerpc/lib/locks.c 
 	 * load lock_token and paca_index with a single lwz
@@ -160,10 +161,14 @@ struct paca_struct {
 	u64 saved_msr;			/* MSR saved here by enter_rtas */
 	u16 trap_save;			/* Used when bad stack is encountered */
 	u8 irq_soft_mask;		/* mask for irq soft masking */
+	u8 soft_enabled;		/* irq soft-enable flag */
 	u8 irq_happened;		/* irq happened while soft-disabled */
 	u8 io_sync;			/* writel() needs spin_unlock sync */
 	u8 irq_work_pending;		/* IRQ_WORK interrupt while soft-disable */
 	u8 nap_state_lost;		/* NV GPR values lost in power7_idle */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	u8 pmcregs_in_use;		/* pseries puts this in lppaca */
+#endif
 	u64 sprg_vdso;			/* Saved user-visible sprg */
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 	u64 tm_scratch;                 /* TM scratch area for reclaim */
diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
index 5a9ede4962cb..7ac3586c38ab 100644
--- a/arch/powerpc/include/asm/pmc.h
+++ b/arch/powerpc/include/asm/pmc.h
@@ -31,10 +31,21 @@ void ppc_enable_pmcs(void);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 #include <asm/lppaca.h>
+#include <asm/firmware.h>
 
 static inline void ppc_set_pmu_inuse(int inuse)
 {
-	get_lppaca()->pmcregs_in_use = inuse;
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
+	if (firmware_has_feature(FW_FEATURE_LPAR)) {
+#ifdef CONFIG_PPC_PSERIES
+		get_lppaca()->pmcregs_in_use = inuse;
+#endif
+	} else {
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+		get_paca()->pmcregs_in_use = inuse;
+#endif
+	}
+#endif
 }
 
 extern void power4_enable_pmcs(void);
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 88b84ac76b53..b9b52490acfd 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -221,12 +221,17 @@ int main(void)
 	OFFSET(PACA_EXMC, paca_struct, exmc);
 	OFFSET(PACA_EXSLB, paca_struct, exslb);
 	OFFSET(PACA_EXNMI, paca_struct, exnmi);
+#ifdef CONFIG_PPC_PSERIES
 	OFFSET(PACALPPACAPTR, paca_struct, lppaca_ptr);
+#endif
 	OFFSET(PACA_SLBSHADOWPTR, paca_struct, slb_shadow_ptr);
 	OFFSET(SLBSHADOW_STACKVSID, slb_shadow, save_area[SLB_NUM_BOLTED - 1].vsid);
 	OFFSET(SLBSHADOW_STACKESID, slb_shadow, save_area[SLB_NUM_BOLTED - 1].esid);
 	OFFSET(SLBSHADOW_SAVEAREA, slb_shadow, save_area);
 	OFFSET(LPPACA_PMCINUSE, lppaca, pmcregs_in_use);
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	OFFSET(PACA_PMCINUSE, paca_struct, pmcregs_in_use);
+#endif
 	OFFSET(LPPACA_DTLIDX, lppaca, dtl_idx);
 	OFFSET(LPPACA_YIELDCOUNT, lppaca, yield_count);
 	OFFSET(PACA_DTL_RIDX, paca_struct, dtl_ridx);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 95ffedf14885..5900540e2ff8 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -20,7 +20,7 @@
 
 #include "setup.h"
 
-#ifdef CONFIG_PPC_BOOK3S
+#ifdef CONFIG_PPC_PSERIES
 
 /*
  * The structure which the hypervisor knows about - this structure
@@ -47,6 +47,9 @@ static long __initdata lppaca_size;
 
 static void __init allocate_lppacas(int nr_cpus, unsigned long limit)
 {
+	if (early_cpu_has_feature(CPU_FTR_HVMODE))
+		return;
+
 	if (nr_cpus <= NR_LPPACAS)
 		return;
 
@@ -60,6 +63,9 @@ static struct lppaca * __init new_lppaca(int cpu)
 {
 	struct lppaca *lp;
 
+	if (early_cpu_has_feature(CPU_FTR_HVMODE))
+		return NULL;
+
 	if (cpu < NR_LPPACAS)
 		return &lppaca[cpu];
 
@@ -73,6 +79,9 @@ static void __init free_lppacas(void)
 {
 	long new_size = 0, nr;
 
+	if (early_cpu_has_feature(CPU_FTR_HVMODE))
+		return;
+
 	if (!lppaca_size)
 		return;
 	nr = num_possible_cpus() - NR_LPPACAS;
@@ -157,9 +166,10 @@ EXPORT_SYMBOL(paca);
 
 void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 {
-#ifdef CONFIG_PPC_BOOK3S
+#ifdef CONFIG_PPC_PSERIES
 	new_paca->lppaca_ptr = new_lppaca(cpu);
-#else
+#endif
+#ifdef CONFIG_PPC_BOOK3E
 	new_paca->kernel_pgd = swapper_pg_dir;
 #endif
 	new_paca->lock_token = 0x8000;
diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S b/arch/powerpc/kvm/book3s_hv_interrupts.S
index dc54373c8780..0e8493033288 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -79,8 +79,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 	li	r5, 0
 	mtspr	SPRN_MMCRA, r5
 	isync
-	ld	r3, PACALPPACAPTR(r13)	/* is the host using the PMU? */
-	lbz	r5, LPPACA_PMCINUSE(r3)
+	lbz	r5, PACA_PMCINUSE(r13)	/* is the host using the PMU? */
 	cmpwi	r5, 0
 	beq	31f			/* skip if not */
 	mfspr	r5, SPRN_MMCR1
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 7886b313d135..bb2ed7d1b96a 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -113,8 +113,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 	mtspr	SPRN_SPRG_VDSO_WRITE,r3
 
 	/* Reload the host's PMU registers */
-	ld	r3, PACALPPACAPTR(r13)	/* is the host using the PMU? */
-	lbz	r4, LPPACA_PMCINUSE(r3)
+	lbz	r4, PACA_PMCINUSE(r13) /* is the host using the PMU? */
 	cmpwi	r4, 0
 	beq	23f			/* skip if not */
 BEGIN_FTR_SECTION
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/14] powerpc/64: Use array of paca pointers and allocate pacas individually
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 01/14] powerpc/64s: do not allocate lppaca if we are not virtualized Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 03/14] powerpc/64s: allocate lppacas individually Nicholas Piggin
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Change the paca array into an array of pointers to pacas. Allocate
pacas individually.

This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.

This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/kvm_ppc.h           |  8 ++--
 arch/powerpc/include/asm/lppaca.h            |  2 +-
 arch/powerpc/include/asm/paca.h              |  4 +-
 arch/powerpc/include/asm/smp.h               |  4 +-
 arch/powerpc/kernel/crash.c                  |  2 +-
 arch/powerpc/kernel/head_64.S                | 19 ++++----
 arch/powerpc/kernel/machine_kexec_64.c       | 22 ++++-----
 arch/powerpc/kernel/paca.c                   | 70 +++++++++++++++++++---------
 arch/powerpc/kernel/setup_64.c               | 23 ++++-----
 arch/powerpc/kernel/smp.c                    | 10 ++--
 arch/powerpc/kernel/sysfs.c                  |  2 +-
 arch/powerpc/kvm/book3s_hv.c                 | 31 ++++++------
 arch/powerpc/kvm/book3s_hv_builtin.c         |  2 +-
 arch/powerpc/mm/tlb-radix.c                  |  2 +-
 arch/powerpc/platforms/85xx/smp.c            |  8 ++--
 arch/powerpc/platforms/cell/smp.c            |  4 +-
 arch/powerpc/platforms/powernv/idle.c        | 13 +++---
 arch/powerpc/platforms/powernv/setup.c       |  4 +-
 arch/powerpc/platforms/powernv/smp.c         |  2 +-
 arch/powerpc/platforms/powernv/subcore.c     |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 +-
 arch/powerpc/platforms/pseries/lpar.c        |  4 +-
 arch/powerpc/platforms/pseries/setup.c       |  2 +-
 arch/powerpc/platforms/pseries/smp.c         |  4 +-
 arch/powerpc/sysdev/xics/icp-native.c        |  2 +-
 arch/powerpc/xmon/xmon.c                     |  2 +-
 26 files changed, 143 insertions(+), 107 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 9db18287b5f4..8908481cdfd7 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -432,15 +432,15 @@ struct openpic;
 extern void kvm_cma_reserve(void) __init;
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
-	paca[cpu].kvm_hstate.xics_phys = (void __iomem *)addr;
+	paca_ptrs[cpu]->kvm_hstate.xics_phys = (void __iomem *)addr;
 }
 
 static inline void kvmppc_set_xive_tima(int cpu,
 					unsigned long phys_addr,
 					void __iomem *virt_addr)
 {
-	paca[cpu].kvm_hstate.xive_tima_phys = (void __iomem *)phys_addr;
-	paca[cpu].kvm_hstate.xive_tima_virt = virt_addr;
+	paca_ptrs[cpu]->kvm_hstate.xive_tima_phys = (void __iomem *)phys_addr;
+	paca_ptrs[cpu]->kvm_hstate.xive_tima_virt = virt_addr;
 }
 
 static inline u32 kvmppc_get_xics_latch(void)
@@ -454,7 +454,7 @@ static inline u32 kvmppc_get_xics_latch(void)
 
 static inline void kvmppc_set_host_ipi(int cpu, u8 host_ipi)
 {
-	paca[cpu].kvm_hstate.host_ipi = host_ipi;
+	paca_ptrs[cpu]->kvm_hstate.host_ipi = host_ipi;
 }
 
 static inline void kvmppc_fast_vcpu_kick(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index d0a2a2f99564..6e4589eee2da 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -103,7 +103,7 @@ struct lppaca {
 
 extern struct lppaca lppaca[];
 
-#define lppaca_of(cpu)	(*paca[cpu].lppaca_ptr)
+#define lppaca_of(cpu)	(*paca_ptrs[cpu]->lppaca_ptr)
 
 /*
  * We are using a non architected field to determine if a partition is
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 57fe8aa0c257..f266b0a7be95 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -246,10 +246,10 @@ struct paca_struct {
 	void *rfi_flush_fallback_area;
 	u64 l1d_flush_size;
 #endif
-};
+} ____cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
-extern struct paca_struct *paca;
+extern struct paca_struct **paca_ptrs;
 extern void initialise_paca(struct paca_struct *new_paca, int cpu);
 extern void setup_paca(struct paca_struct *new_paca);
 extern void allocate_pacas(void);
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index fac963e10d39..ec7b299350d9 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -170,12 +170,12 @@ static inline const struct cpumask *cpu_sibling_mask(int cpu)
 #ifdef CONFIG_PPC64
 static inline int get_hard_smp_processor_id(int cpu)
 {
-	return paca[cpu].hw_cpu_id;
+	return paca_ptrs[cpu]->hw_cpu_id;
 }
 
 static inline void set_hard_smp_processor_id(int cpu, int phys)
 {
-	paca[cpu].hw_cpu_id = phys;
+	paca_ptrs[cpu]->hw_cpu_id = phys;
 }
 #else
 /* 32-bit */
diff --git a/arch/powerpc/kernel/crash.c b/arch/powerpc/kernel/crash.c
index 00b215125d3e..17c8b99680f2 100644
--- a/arch/powerpc/kernel/crash.c
+++ b/arch/powerpc/kernel/crash.c
@@ -238,7 +238,7 @@ static void __maybe_unused crash_kexec_wait_realmode(int cpu)
 		if (i == cpu)
 			continue;
 
-		while (paca[i].kexec_state < KEXEC_STATE_REAL_MODE) {
+		while (paca_ptrs[i]->kexec_state < KEXEC_STATE_REAL_MODE) {
 			barrier();
 			if (!cpu_possible(i) || !cpu_online(i) || (msecs <= 0))
 				break;
diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index a61151a6ea5e..6eca15f25c73 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -392,19 +392,20 @@ generic_secondary_common_init:
 	 * physical cpu id in r24, we need to search the pacas to find
 	 * which logical id maps to our physical one.
 	 */
-	LOAD_REG_ADDR(r13, paca)	/* Load paca pointer		 */
-	ld	r13,0(r13)		/* Get base vaddr of paca array	 */
 #ifndef CONFIG_SMP
-	addi	r13,r13,PACA_SIZE	/* know r13 if used accidentally */
 	b	kexec_wait		/* wait for next kernel if !SMP	 */
 #else
+	LOAD_REG_ADDR(r8, paca_ptrs)	/* Load paca_ptrs pointe	 */
+	ld	r8,0(r8)		/* Get base vaddr of array	 */
 	LOAD_REG_ADDR(r7, nr_cpu_ids)	/* Load nr_cpu_ids address       */
 	lwz	r7,0(r7)		/* also the max paca allocated 	 */
 	li	r5,0			/* logical cpu id                */
-1:	lhz	r6,PACAHWCPUID(r13)	/* Load HW procid from paca      */
+1:
+	sldi	r9,r5,3			/* get paca_ptrs[] index from cpu id */
+	ldx	r13,r9,r8		/* r13 = paca_ptrs[cpu id]       */
+	lhz	r6,PACAHWCPUID(r13)	/* Load HW procid from paca      */
 	cmpw	r6,r24			/* Compare to our id             */
 	beq	2f
-	addi	r13,r13,PACA_SIZE	/* Loop to next PACA on miss     */
 	addi	r5,r5,1
 	cmpw	r5,r7			/* Check if more pacas exist     */
 	blt	1b
@@ -756,10 +757,10 @@ _GLOBAL(pmac_secondary_start)
 	mtmsrd	r3			/* RI on */
 
 	/* Set up a paca value for this processor. */
-	LOAD_REG_ADDR(r4,paca)		/* Load paca pointer		*/
-	ld	r4,0(r4)		/* Get base vaddr of paca array	*/
-	mulli	r13,r24,PACA_SIZE	/* Calculate vaddr of right paca */
-	add	r13,r13,r4		/* for this processor.		*/
+	LOAD_REG_ADDR(r4,paca_ptrs)	/* Load paca pointer		*/
+	ld	r4,0(r4)		/* Get base vaddr of paca_ptrs array */
+	sldi	r5,r24,3		/* get paca_ptrs[] index from cpu id */
+	ldx	r13,r5,r4		/* r13 = paca_ptrs[cpu id]       */
 	SET_PACA(r13)			/* Save vaddr of paca in an SPRG*/
 
 	/* Mark interrupts soft and hard disabled (they might be enabled
diff --git a/arch/powerpc/kernel/machine_kexec_64.c b/arch/powerpc/kernel/machine_kexec_64.c
index 49d34d7271e7..a250e3331f94 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -168,24 +168,25 @@ static void kexec_prepare_cpus_wait(int wait_state)
 	 * are correctly onlined.  If somehow we start a CPU on boot with RTAS
 	 * start-cpu, but somehow that CPU doesn't write callin_cpu_map[] in
 	 * time, the boot CPU will timeout.  If it does eventually execute
-	 * stuff, the secondary will start up (paca[].cpu_start was written) and
-	 * get into a peculiar state.  If the platform supports
-	 * smp_ops->take_timebase(), the secondary CPU will probably be spinning
-	 * in there.  If not (i.e. pseries), the secondary will continue on and
-	 * try to online itself/idle/etc. If it survives that, we need to find
-	 * these possible-but-not-online-but-should-be CPUs and chaperone them
-	 * into kexec_smp_wait().
+	 * stuff, the secondary will start up (paca_ptrs[]->cpu_start was
+	 * written) and get into a peculiar state.
+	 * If the platform supports smp_ops->take_timebase(), the secondary CPU
+	 * will probably be spinning in there.  If not (i.e. pseries), the
+	 * secondary will continue on and try to online itself/idle/etc. If it
+	 * survives that, we need to find these
+	 * possible-but-not-online-but-should-be CPUs and chaperone them into
+	 * kexec_smp_wait().
 	 */
 	for_each_online_cpu(i) {
 		if (i == my_cpu)
 			continue;
 
-		while (paca[i].kexec_state < wait_state) {
+		while (paca_ptrs[i]->kexec_state < wait_state) {
 			barrier();
 			if (i != notified) {
 				printk(KERN_INFO "kexec: waiting for cpu %d "
 				       "(physical %d) to enter %i state\n",
-				       i, paca[i].hw_cpu_id, wait_state);
+				       i, paca_ptrs[i]->hw_cpu_id, wait_state);
 				notified = i;
 			}
 		}
@@ -327,8 +328,7 @@ void default_machine_kexec(struct kimage *image)
 	 */
 	memcpy(&kexec_paca, get_paca(), sizeof(struct paca_struct));
 	kexec_paca.data_offset = 0xedeaddeadeeeeeeeUL;
-	paca = (struct paca_struct *)RELOC_HIDE(&kexec_paca, 0) -
-		kexec_paca.paca_index;
+	paca_ptrs[kexec_paca.paca_index] = &kexec_paca;
 	setup_paca(&kexec_paca);
 
 	/* XXX: If anyone does 'dynamic lppacas' this will also need to be
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 5900540e2ff8..eef4891c9af6 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -161,8 +161,8 @@ static void __init allocate_slb_shadows(int nr_cpus, int limit) { }
  * processors.  The processor VPD array needs one entry per physical
  * processor (not thread).
  */
-struct paca_struct *paca;
-EXPORT_SYMBOL(paca);
+struct paca_struct **paca_ptrs __read_mostly;
+EXPORT_SYMBOL(paca_ptrs);
 
 void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 {
@@ -213,11 +213,13 @@ void setup_paca(struct paca_struct *new_paca)
 
 }
 
-static int __initdata paca_size;
+static int __initdata paca_nr_cpu_ids;
+static int __initdata paca_ptrs_size;
 
 void __init allocate_pacas(void)
 {
 	u64 limit;
+	unsigned long size = 0;
 	int cpu;
 
 #ifdef CONFIG_PPC_BOOK3S_64
@@ -230,13 +232,27 @@ void __init allocate_pacas(void)
 	limit = ppc64_rma_size;
 #endif
 
-	paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids);
+	paca_nr_cpu_ids = nr_cpu_ids;
 
-	paca = __va(memblock_alloc_base(paca_size, PAGE_SIZE, limit));
-	memset(paca, 0, paca_size);
+	paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+	paca_ptrs = __va(memblock_alloc_base(paca_ptrs_size, 0, limit));
+	memset(paca_ptrs, 0, paca_ptrs_size);
 
-	printk(KERN_DEBUG "Allocated %u bytes for %u pacas at %p\n",
-		paca_size, nr_cpu_ids, paca);
+	size += paca_ptrs_size;
+
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+		unsigned long pa;
+
+		pa = memblock_alloc_base(sizeof(struct paca_struct),
+						L1_CACHE_BYTES, limit);
+		paca_ptrs[cpu] = __va(pa);
+		memset(paca_ptrs[cpu], 0, sizeof(struct paca_struct));
+
+		size += sizeof(struct paca_struct);
+	}
+
+	printk(KERN_DEBUG "Allocated %lu bytes for %u pacas\n",
+			size, nr_cpu_ids);
 
 	allocate_lppacas(nr_cpu_ids, limit);
 
@@ -244,26 +260,38 @@ void __init allocate_pacas(void)
 
 	/* Can't use for_each_*_cpu, as they aren't functional yet */
 	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
-		initialise_paca(&paca[cpu], cpu);
+		initialise_paca(paca_ptrs[cpu], cpu);
 }
 
 void __init free_unused_pacas(void)
 {
-	int new_size;
-
-	new_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids);
-
-	if (new_size >= paca_size)
-		return;
-
-	memblock_free(__pa(paca) + new_size, paca_size - new_size);
-
-	printk(KERN_DEBUG "Freed %u bytes for unused pacas\n",
-		paca_size - new_size);
+	unsigned long size = 0;
+	int new_ptrs_size;
+	int cpu;
 
-	paca_size = new_size;
+	for (cpu = 0; cpu < paca_nr_cpu_ids; cpu++) {
+		if (!cpu_possible(cpu)) {
+			unsigned long pa = __pa(paca_ptrs[cpu]);
+			memblock_free(pa, sizeof(struct paca_struct));
+			paca_ptrs[cpu] = NULL;
+			size += sizeof(struct paca_struct);
+		}
+	}
+
+	new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+	if (new_ptrs_size < paca_ptrs_size) {
+		memblock_free(__pa(paca_ptrs) + new_ptrs_size,
+					paca_ptrs_size - new_ptrs_size);
+		size += paca_ptrs_size - new_ptrs_size;
+	}
+
+	if (size)
+		printk(KERN_DEBUG "Freed %lu bytes for unused pacas\n", size);
 
 	free_lppacas();
+
+	paca_nr_cpu_ids = nr_cpu_ids;
+	paca_ptrs_size = new_ptrs_size;
 }
 
 void copy_mm_to_paca(struct mm_struct *mm)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index c388cc3357fa..3ce12af4906f 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -110,7 +110,7 @@ void __init setup_tlb_core_data(void)
 		if (cpu_first_thread_sibling(boot_cpuid) == first)
 			first = boot_cpuid;
 
-		paca[cpu].tcd_ptr = &paca[first].tcd;
+		paca_ptrs[cpu]->tcd_ptr = &paca_ptrs[first]->tcd;
 
 		/*
 		 * If we have threads, we need either tlbsrx.
@@ -304,7 +304,7 @@ void __init early_setup(unsigned long dt_ptr)
 	early_init_devtree(__va(dt_ptr));
 
 	/* Now we know the logical id of our boot cpu, setup the paca. */
-	setup_paca(&paca[boot_cpuid]);
+	setup_paca(paca_ptrs[boot_cpuid]);
 	fixup_boot_paca();
 
 	/*
@@ -628,15 +628,15 @@ void __init exc_lvl_early_init(void)
 	for_each_possible_cpu(i) {
 		sp = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
 		critirq_ctx[i] = (struct thread_info *)__va(sp);
-		paca[i].crit_kstack = __va(sp + THREAD_SIZE);
+		paca_ptrs[i]->crit_kstack = __va(sp + THREAD_SIZE);
 
 		sp = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
 		dbgirq_ctx[i] = (struct thread_info *)__va(sp);
-		paca[i].dbg_kstack = __va(sp + THREAD_SIZE);
+		paca_ptrs[i]->dbg_kstack = __va(sp + THREAD_SIZE);
 
 		sp = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
 		mcheckirq_ctx[i] = (struct thread_info *)__va(sp);
-		paca[i].mc_kstack = __va(sp + THREAD_SIZE);
+		paca_ptrs[i]->mc_kstack = __va(sp + THREAD_SIZE);
 	}
 
 	if (cpu_has_feature(CPU_FTR_DEBUG_LVL_EXC))
@@ -693,20 +693,20 @@ void __init emergency_stack_init(void)
 		ti = __va(memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit));
 		memset(ti, 0, THREAD_SIZE);
 		emerg_stack_init_thread_info(ti, i);
-		paca[i].emergency_sp = (void *)ti + THREAD_SIZE;
+		paca_ptrs[i]->emergency_sp = (void *)ti + THREAD_SIZE;
 
 #ifdef CONFIG_PPC_BOOK3S_64
 		/* emergency stack for NMI exception handling. */
 		ti = __va(memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit));
 		memset(ti, 0, THREAD_SIZE);
 		emerg_stack_init_thread_info(ti, i);
-		paca[i].nmi_emergency_sp = (void *)ti + THREAD_SIZE;
+		paca_ptrs[i]->nmi_emergency_sp = (void *)ti + THREAD_SIZE;
 
 		/* emergency stack for machine check exception handling. */
 		ti = __va(memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit));
 		memset(ti, 0, THREAD_SIZE);
 		emerg_stack_init_thread_info(ti, i);
-		paca[i].mc_emergency_sp = (void *)ti + THREAD_SIZE;
+		paca_ptrs[i]->mc_emergency_sp = (void *)ti + THREAD_SIZE;
 #endif
 	}
 }
@@ -762,7 +762,7 @@ void __init setup_per_cpu_areas(void)
 	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
 	for_each_possible_cpu(cpu) {
                 __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
-		paca[cpu].data_offset = __per_cpu_offset[cpu];
+		paca_ptrs[cpu]->data_offset = __per_cpu_offset[cpu];
 	}
 }
 #endif
@@ -875,8 +875,9 @@ static void init_fallback_flush(void)
 	memset(l1d_flush_fallback_area, 0, l1d_size * 2);
 
 	for_each_possible_cpu(cpu) {
-		paca[cpu].rfi_flush_fallback_area = l1d_flush_fallback_area;
-		paca[cpu].l1d_flush_size = l1d_size;
+		struct paca_struct *paca = paca_ptrs[cpu];
+		paca->rfi_flush_fallback_area = l1d_flush_fallback_area;
+		paca->l1d_flush_size = l1d_size;
 	}
 }
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bbe7634b3a43..cfc08b099c49 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -123,8 +123,8 @@ int smp_generic_kick_cpu(int nr)
 	 * cpu_start field to become non-zero After we set cpu_start,
 	 * the processor will continue on to secondary_start
 	 */
-	if (!paca[nr].cpu_start) {
-		paca[nr].cpu_start = 1;
+	if (!paca_ptrs[nr]->cpu_start) {
+		paca_ptrs[nr]->cpu_start = 1;
 		smp_mb();
 		return 0;
 	}
@@ -657,7 +657,7 @@ void smp_prepare_boot_cpu(void)
 {
 	BUG_ON(smp_processor_id() != boot_cpuid);
 #ifdef CONFIG_PPC64
-	paca[boot_cpuid].__current = current;
+	paca_ptrs[boot_cpuid]->__current = current;
 #endif
 	set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
 	current_set[boot_cpuid] = task_thread_info(current);
@@ -748,8 +748,8 @@ static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle)
 	struct thread_info *ti = task_thread_info(idle);
 
 #ifdef CONFIG_PPC64
-	paca[cpu].__current = idle;
-	paca[cpu].kstack = (unsigned long)ti + THREAD_SIZE - STACK_FRAME_OVERHEAD;
+	paca_ptrs[cpu]->__current = idle;
+	paca_ptrs[cpu]->kstack = (unsigned long)ti + THREAD_SIZE - STACK_FRAME_OVERHEAD;
 #endif
 	ti->cpu = cpu;
 	secondary_ti = current_set[cpu] = ti;
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 5a8bfee6e187..1f9d94dac3a6 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -600,7 +600,7 @@ void __init record_spr_defaults(void)
 	if (cpu_has_feature(CPU_FTR_DSCR)) {
 		dscr_default = mfspr(SPRN_DSCR);
 		for (cpu = 0; cpu < nr_cpu_ids; cpu++)
-			paca[cpu].dscr_default = dscr_default;
+			paca_ptrs[cpu]->dscr_default = dscr_default;
 	}
 }
 #endif /* CONFIG_PPC64 */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e4f70c33fbc7..d340bda12067 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -167,7 +167,7 @@ static bool kvmppc_ipi_thread(int cpu)
 
 #if defined(CONFIG_PPC_ICP_NATIVE) && defined(CONFIG_SMP)
 	if (cpu >= 0 && cpu < nr_cpu_ids) {
-		if (paca[cpu].kvm_hstate.xics_phys) {
+		if (paca_ptrs[cpu]->kvm_hstate.xics_phys) {
 			xics_wake_cpu(cpu);
 			return true;
 		}
@@ -2122,7 +2122,7 @@ static int kvmppc_grab_hwthread(int cpu)
 	struct paca_struct *tpaca;
 	long timeout = 10000;
 
-	tpaca = &paca[cpu];
+	tpaca = paca_ptrs[cpu];
 
 	/* Ensure the thread won't go into the kernel if it wakes */
 	tpaca->kvm_hstate.kvm_vcpu = NULL;
@@ -2155,7 +2155,7 @@ static void kvmppc_release_hwthread(int cpu)
 {
 	struct paca_struct *tpaca;
 
-	tpaca = &paca[cpu];
+	tpaca = paca_ptrs[cpu];
 	tpaca->kvm_hstate.hwthread_req = 0;
 	tpaca->kvm_hstate.kvm_vcpu = NULL;
 	tpaca->kvm_hstate.kvm_vcore = NULL;
@@ -2221,7 +2221,7 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, struct kvmppc_vcore *vc)
 		vcpu->arch.thread_cpu = cpu;
 		cpumask_set_cpu(cpu, &kvm->arch.cpu_in_guest);
 	}
-	tpaca = &paca[cpu];
+	tpaca = paca_ptrs[cpu];
 	tpaca->kvm_hstate.kvm_vcpu = vcpu;
 	tpaca->kvm_hstate.ptid = cpu - vc->pcpu;
 	/* Order stores to hstate.kvm_vcpu etc. before store to kvm_vcore */
@@ -2246,7 +2246,7 @@ static void kvmppc_wait_for_nap(int n_threads)
 		 * for any threads that still have a non-NULL vcore ptr.
 		 */
 		for (i = 1; i < n_threads; ++i)
-			if (paca[cpu + i].kvm_hstate.kvm_vcore)
+			if (paca_ptrs[cpu + i]->kvm_hstate.kvm_vcore)
 				break;
 		if (i == n_threads) {
 			HMT_medium();
@@ -2256,7 +2256,7 @@ static void kvmppc_wait_for_nap(int n_threads)
 	}
 	HMT_medium();
 	for (i = 1; i < n_threads; ++i)
-		if (paca[cpu + i].kvm_hstate.kvm_vcore)
+		if (paca_ptrs[cpu + i]->kvm_hstate.kvm_vcore)
 			pr_err("KVM: CPU %d seems to be stuck\n", cpu + i);
 }
 
@@ -2786,9 +2786,11 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
 	}
 
 	for (thr = 0; thr < controlled_threads; ++thr) {
-		paca[pcpu + thr].kvm_hstate.tid = thr;
-		paca[pcpu + thr].kvm_hstate.napping = 0;
-		paca[pcpu + thr].kvm_hstate.kvm_split_mode = sip;
+		struct paca_struct *paca = paca_ptrs[pcpu + thr];
+
+		paca->kvm_hstate.tid = thr;
+		paca->kvm_hstate.napping = 0;
+		paca->kvm_hstate.kvm_split_mode = sip;
 	}
 
 	/* Initiate micro-threading (split-core) on POWER8 if required */
@@ -2906,7 +2908,9 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
 	} else if (hpt_on_radix) {
 		/* Wait for all threads to have seen final sync */
 		for (thr = 1; thr < controlled_threads; ++thr) {
-			while (paca[pcpu + thr].kvm_hstate.kvm_split_mode) {
+			struct paca_struct *paca = paca_ptrs[pcpu + thr];
+
+			while (paca->kvm_hstate.kvm_split_mode) {
 				HMT_low();
 				barrier();
 			}
@@ -4376,7 +4380,7 @@ static int kvm_init_subcore_bitmap(void)
 		int node = cpu_to_node(first_cpu);
 
 		/* Ignore if it is already allocated. */
-		if (paca[first_cpu].sibling_subcore_state)
+		if (paca_ptrs[first_cpu]->sibling_subcore_state)
 			continue;
 
 		sibling_subcore_state =
@@ -4391,7 +4395,8 @@ static int kvm_init_subcore_bitmap(void)
 		for (j = 0; j < threads_per_core; j++) {
 			int cpu = first_cpu + j;
 
-			paca[cpu].sibling_subcore_state = sibling_subcore_state;
+			paca_ptrs[cpu]->sibling_subcore_state =
+						sibling_subcore_state;
 		}
 	}
 	return 0;
@@ -4418,7 +4423,7 @@ static int kvmppc_book3s_init_hv(void)
 
 	/*
 	 * We need a way of accessing the XICS interrupt controller,
-	 * either directly, via paca[cpu].kvm_hstate.xics_phys, or
+	 * either directly, via paca_ptrs[cpu]->kvm_hstate.xics_phys, or
 	 * indirectly, via OPAL.
 	 */
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index 49a2c7825e04..de18299f92b7 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -251,7 +251,7 @@ void kvmhv_rm_send_ipi(int cpu)
 	    return;
 
 	/* Else poke the target with an IPI */
-	xics_phys = paca[cpu].kvm_hstate.xics_phys;
+	xics_phys = paca_ptrs[cpu]->kvm_hstate.xics_phys;
 	if (xics_phys)
 		__raw_rm_writeb(IPI_PRIORITY, xics_phys + XICS_MFRR);
 	else
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 71d1b19ad1c0..e6016f4466f3 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -723,7 +723,7 @@ extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
 		for (; sib <= cpu_last_thread_sibling(cpu) && !flush; sib++) {
 			if (sib == cpu)
 				continue;
-			if (paca[sib].kvm_hstate.kvm_vcpu)
+			if (paca_ptrs[sib]->kvm_hstate.kvm_vcpu)
 				flush = true;
 		}
 		if (flush)
diff --git a/arch/powerpc/platforms/85xx/smp.c b/arch/powerpc/platforms/85xx/smp.c
index f51fd35f4618..7e966f4cf19a 100644
--- a/arch/powerpc/platforms/85xx/smp.c
+++ b/arch/powerpc/platforms/85xx/smp.c
@@ -147,7 +147,7 @@ static void qoriq_cpu_kill(unsigned int cpu)
 	for (i = 0; i < 500; i++) {
 		if (is_cpu_dead(cpu)) {
 #ifdef CONFIG_PPC64
-			paca[cpu].cpu_start = 0;
+			paca_ptrs[cpu]->cpu_start = 0;
 #endif
 			return;
 		}
@@ -328,7 +328,7 @@ static int smp_85xx_kick_cpu(int nr)
 		return ret;
 
 done:
-	paca[nr].cpu_start = 1;
+	paca_ptrs[nr]->cpu_start = 1;
 	generic_set_cpu_up(nr);
 
 	return ret;
@@ -409,14 +409,14 @@ void mpc85xx_smp_kexec_cpu_down(int crash_shutdown, int secondary)
 	}
 
 	if (disable_threadbit) {
-		while (paca[disable_cpu].kexec_state < KEXEC_STATE_REAL_MODE) {
+		while (paca_ptrs[disable_cpu]->kexec_state < KEXEC_STATE_REAL_MODE) {
 			barrier();
 			now = mftb();
 			if (!notified && now - start > 1000000) {
 				pr_info("%s/%d: waiting for cpu %d to enter KEXEC_STATE_REAL_MODE (%d)\n",
 					__func__, smp_processor_id(),
 					disable_cpu,
-					paca[disable_cpu].kexec_state);
+					paca_ptrs[disable_cpu]->kexec_state);
 				notified = true;
 			}
 		}
diff --git a/arch/powerpc/platforms/cell/smp.c b/arch/powerpc/platforms/cell/smp.c
index f84d52a2db40..1aeac5761e0b 100644
--- a/arch/powerpc/platforms/cell/smp.c
+++ b/arch/powerpc/platforms/cell/smp.c
@@ -83,7 +83,7 @@ static inline int smp_startup_cpu(unsigned int lcpu)
 	pcpu = get_hard_smp_processor_id(lcpu);
 
 	/* Fixup atomic count: it exited inside IRQ handler. */
-	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
+	task_thread_info(paca_ptrs[lcpu]->__current)->preempt_count	= 0;
 
 	/*
 	 * If the RTAS start-cpu token does not exist then presume the
@@ -126,7 +126,7 @@ static int smp_cell_kick_cpu(int nr)
 	 * cpu_start field to become non-zero After we set cpu_start,
 	 * the processor will continue on to secondary_start
 	 */
-	paca[nr].cpu_start = 1;
+	paca_ptrs[nr]->cpu_start = 1;
 
 	return 0;
 }
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 443d5ca71995..5b2ca71ee551 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -80,7 +80,7 @@ static int pnv_save_sprs_for_deep_states(void)
 
 	for_each_possible_cpu(cpu) {
 		uint64_t pir = get_hard_smp_processor_id(cpu);
-		uint64_t hsprg0_val = (uint64_t)&paca[cpu];
+		uint64_t hsprg0_val = (uint64_t)paca_ptrs[cpu];
 
 		rc = opal_slw_set_reg(pir, SPRN_HSPRG0, hsprg0_val);
 		if (rc != 0)
@@ -173,12 +173,12 @@ static void pnv_alloc_idle_core_states(void)
 		for (j = 0; j < threads_per_core; j++) {
 			int cpu = first_cpu + j;
 
-			paca[cpu].core_idle_state_ptr = core_idle_state;
-			paca[cpu].thread_idle_state = PNV_THREAD_RUNNING;
-			paca[cpu].thread_mask = 1 << j;
+			paca_ptrs[cpu]->core_idle_state_ptr = core_idle_state;
+			paca_ptrs[cpu]->thread_idle_state = PNV_THREAD_RUNNING;
+			paca_ptrs[cpu]->thread_mask = 1 << j;
 			if (!cpu_has_feature(CPU_FTR_POWER9_DD1))
 				continue;
-			paca[cpu].thread_sibling_pacas =
+			paca_ptrs[cpu]->thread_sibling_pacas =
 				kmalloc_node(paca_ptr_array_size,
 					     GFP_KERNEL, node);
 		}
@@ -749,7 +749,8 @@ static int __init pnv_init_idle_states(void)
 			for (i = 0; i < threads_per_core; i++) {
 				int j = base_cpu + i;
 
-				paca[j].thread_sibling_pacas[idx] = &paca[cpu];
+				paca_ptrs[j]->thread_sibling_pacas[idx] =
+					paca_ptrs[cpu];
 			}
 		}
 	}
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 4fb21e17504a..b62ca0220ea5 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -254,7 +254,7 @@ static void pnv_kexec_wait_secondaries_down(void)
 			if (i != notified) {
 				printk(KERN_INFO "kexec: waiting for cpu %d "
 				       "(physical %d) to enter OPAL\n",
-				       i, paca[i].hw_cpu_id);
+				       i, paca_ptrs[i]->hw_cpu_id);
 				notified = i;
 			}
 
@@ -266,7 +266,7 @@ static void pnv_kexec_wait_secondaries_down(void)
 			if (timeout-- == 0) {
 				printk(KERN_ERR "kexec: timed out waiting for "
 				       "cpu %d (physical %d) to enter OPAL\n",
-				       i, paca[i].hw_cpu_id);
+				       i, paca_ptrs[i]->hw_cpu_id);
 				break;
 			}
 		}
diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c
index 9664c8461f03..19af6de6b6f0 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -80,7 +80,7 @@ static int pnv_smp_kick_cpu(int nr)
 	 * If we already started or OPAL is not supported, we just
 	 * kick the CPU via the PACA
 	 */
-	if (paca[nr].cpu_start || !firmware_has_feature(FW_FEATURE_OPAL))
+	if (paca_ptrs[nr]->cpu_start || !firmware_has_feature(FW_FEATURE_OPAL))
 		goto kick;
 
 	/*
diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
index 596ae2e98040..45563004feda 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -280,7 +280,7 @@ void update_subcore_sibling_mask(void)
 		int offset = (tid / threads_per_subcore) * threads_per_subcore;
 		int mask = sibling_mask_first_cpu << offset;
 
-		paca[cpu].subcore_sibling_mask = mask;
+		paca_ptrs[cpu]->subcore_sibling_mask = mask;
 
 	}
 }
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index f78fd2068d56..57365b0e51fb 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -234,7 +234,7 @@ static void pseries_cpu_die(unsigned int cpu)
 	 * done here.  Change isolate state to Isolate and
 	 * change allocation-state to Unusable.
 	 */
-	paca[cpu].cpu_start = 0;
+	paca_ptrs[cpu]->cpu_start = 0;
 }
 
 /*
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 0ee4a469a4ae..b6d2ecce33eb 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -99,7 +99,7 @@ void vpa_init(int cpu)
 	 * reports that.  All SPLPAR support SLB shadow buffer.
 	 */
 	if (!radix_enabled() && firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		addr = __pa(paca[cpu].slb_shadow_ptr);
+		addr = __pa(paca_ptrs[cpu]->slb_shadow_ptr);
 		ret = register_slb_shadow(hwcpu, addr);
 		if (ret)
 			pr_err("WARNING: SLB shadow buffer registration for "
@@ -111,7 +111,7 @@ void vpa_init(int cpu)
 	/*
 	 * Register dispatch trace log, if one has been allocated.
 	 */
-	pp = &paca[cpu];
+	pp = paca_ptrs[cpu];
 	dtl = pp->dispatch_log;
 	if (dtl) {
 		pp->dtl_ridx = 0;
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 372d7ada1a0c..a66005a25c55 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -246,7 +246,7 @@ static int alloc_dispatch_logs(void)
 		return 0;
 
 	for_each_possible_cpu(cpu) {
-		pp = &paca[cpu];
+		pp = paca_ptrs[cpu];
 		dtl = kmem_cache_alloc(dtl_cache, GFP_KERNEL);
 		if (!dtl) {
 			pr_warn("Failed to allocate dispatch trace log for cpu %d\n",
diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c
index 2e184829e5d4..d506bf661f0f 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -110,7 +110,7 @@ static inline int smp_startup_cpu(unsigned int lcpu)
 	}
 
 	/* Fixup atomic count: it exited inside IRQ handler. */
-	task_thread_info(paca[lcpu].__current)->preempt_count	= 0;
+	task_thread_info(paca_ptrs[lcpu]->__current)->preempt_count	= 0;
 #ifdef CONFIG_HOTPLUG_CPU
 	if (get_cpu_current_state(lcpu) == CPU_STATE_INACTIVE)
 		goto out;
@@ -165,7 +165,7 @@ static int smp_pSeries_kick_cpu(int nr)
 	 * cpu_start field to become non-zero After we set cpu_start,
 	 * the processor will continue on to secondary_start
 	 */
-	paca[nr].cpu_start = 1;
+	paca_ptrs[nr]->cpu_start = 1;
 #ifdef CONFIG_HOTPLUG_CPU
 	set_preferred_offline_state(nr, CPU_STATE_ONLINE);
 
diff --git a/arch/powerpc/sysdev/xics/icp-native.c b/arch/powerpc/sysdev/xics/icp-native.c
index 1459f4e8b698..37bfbc54aacb 100644
--- a/arch/powerpc/sysdev/xics/icp-native.c
+++ b/arch/powerpc/sysdev/xics/icp-native.c
@@ -164,7 +164,7 @@ void icp_native_cause_ipi_rm(int cpu)
 	 * Just like the cause_ipi functions, it is required to
 	 * include a full barrier before causing the IPI.
 	 */
-	xics_phys = paca[cpu].kvm_hstate.xics_phys;
+	xics_phys = paca_ptrs[cpu]->kvm_hstate.xics_phys;
 	mb();
 	__raw_rm_writeb(IPI_PRIORITY, xics_phys + XICS_MFRR);
 }
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 82e1a3ee6e0f..b6574b6f7d4a 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2327,7 +2327,7 @@ static void dump_one_paca(int cpu)
 	catch_memory_errors = 1;
 	sync();
 
-	p = &paca[cpu];
+	p = paca_ptrs[cpu];
 
 	printf("paca for cpu 0x%x @ %px:\n", cpu, p);
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/14] powerpc/64s: allocate lppacas individually
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 01/14] powerpc/64s: do not allocate lppaca if we are not virtualized Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 02/14] powerpc/64: Use array of paca pointers and allocate pacas individually Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-13 12:41   ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 04/14] powerpc/64s: allocate slb_shadow structures individually Nicholas Piggin
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Allocate LPPACAs individually.

We no longer allocate lppacas in an array, so this patch removes the 1kB
static alignment for the structure, and enforces the PAPR alignment
requirements at allocation time. We can not reduce the 1kB allocation size
however, due to existing KVM hypervisors.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/lppaca.h      | 24 ++++-----
 arch/powerpc/kernel/machine_kexec_64.c | 15 ++++--
 arch/powerpc/kernel/paca.c             | 89 ++++++++++++----------------------
 arch/powerpc/kvm/book3s_hv.c           |  3 +-
 arch/powerpc/mm/numa.c                 |  4 +-
 arch/powerpc/platforms/pseries/kexec.c |  7 ++-
 6 files changed, 63 insertions(+), 79 deletions(-)

diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index 6e4589eee2da..65d589689f01 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -36,14 +36,16 @@
 #include <asm/mmu.h>
 
 /*
- * We only have to have statically allocated lppaca structs on
- * legacy iSeries, which supports at most 64 cpus.
- */
-#define NR_LPPACAS	1
-
-/*
- * The Hypervisor barfs if the lppaca crosses a page boundary.  A 1k
- * alignment is sufficient to prevent this
+ * The lppaca is the "virtual processor area" registered with the hypervisor,
+ * H_REGISTER_VPA etc.
+ *
+ * According to PAPR, the structure is 640 bytes long, must be L1 cache line
+ * aligned, and must not cross a 4kB boundary. Its size field must be at
+ * least 640 bytes (but may be more).
+ *
+ * Pre-v4.14 KVM hypervisors reject the VPA if its size field is smaller than
+ * 1kB, so we dynamically allocate 1kB and advertise size as 1kB, but keep
+ * this structure as the canonical 640 byte size.
  */
 struct lppaca {
 	/* cacheline 1 contains read-only data */
@@ -97,11 +99,9 @@ struct lppaca {
 
 	__be32	page_ins;		/* CMO Hint - # page ins by OS */
 	u8	reserved11[148];
-	volatile __be64 dtl_idx;		/* Dispatch Trace Log head index */
+	volatile __be64 dtl_idx;	/* Dispatch Trace Log head index */
 	u8	reserved12[96];
-} __attribute__((__aligned__(0x400)));
-
-extern struct lppaca lppaca[];
+} ____cacheline_aligned;
 
 #define lppaca_of(cpu)	(*paca_ptrs[cpu]->lppaca_ptr)
 
diff --git a/arch/powerpc/kernel/machine_kexec_64.c b/arch/powerpc/kernel/machine_kexec_64.c
index a250e3331f94..1044bf15d5ed 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -323,17 +323,24 @@ void default_machine_kexec(struct kimage *image)
 	kexec_stack.thread_info.cpu = current_thread_info()->cpu;
 
 	/* We need a static PACA, too; copy this CPU's PACA over and switch to
-	 * it.  Also poison per_cpu_offset to catch anyone using non-static
-	 * data.
+	 * it. Also poison per_cpu_offset and NULL lppaca to catch anyone using
+	 * non-static data.
 	 */
 	memcpy(&kexec_paca, get_paca(), sizeof(struct paca_struct));
 	kexec_paca.data_offset = 0xedeaddeadeeeeeeeUL;
+#ifdef CONFIG_PPC_PSERIES
+	kexec_paca.lppaca_ptr = NULL;
+#endif
 	paca_ptrs[kexec_paca.paca_index] = &kexec_paca;
+
 	setup_paca(&kexec_paca);
 
-	/* XXX: If anyone does 'dynamic lppacas' this will also need to be
-	 * switched to a static version!
+	/*
+	 * The lppaca should be unregistered at this point so the HV won't
+	 * touch it. In the case of a crash, none of the lppacas are
+	 * unregistered so there is not much we can do about it here.
 	 */
+
 	/*
 	 * On Book3S, the copy must happen with the MMU off if we are either
 	 * using Radix page tables or we are not in an LPAR since we can
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index eef4891c9af6..6cddb9bdc151 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -23,82 +23,50 @@
 #ifdef CONFIG_PPC_PSERIES
 
 /*
- * The structure which the hypervisor knows about - this structure
- * should not cross a page boundary.  The vpa_init/register_vpa call
- * is now known to fail if the lppaca structure crosses a page
- * boundary.  The lppaca is also used on POWER5 pSeries boxes.
- * The lppaca is 640 bytes long, and cannot readily
- * change since the hypervisor knows its layout, so a 1kB alignment
- * will suffice to ensure that it doesn't cross a page boundary.
+ * See asm/lppaca.h for more detail.
+ *
+ * lppaca structures must must be 1kB in size, L1 cache line aligned,
+ * and not cross 4kB boundary. A 1kB size and 1kB alignment will satisfy
+ * these requirements.
  */
-struct lppaca lppaca[] = {
-	[0 ... (NR_LPPACAS-1)] = {
+static inline void init_lppaca(struct lppaca *lppaca)
+{
+	BUILD_BUG_ON(sizeof(struct lppaca) != 640);
+
+	*lppaca = (struct lppaca) {
 		.desc = cpu_to_be32(0xd397d781),	/* "LpPa" */
-		.size = cpu_to_be16(sizeof(struct lppaca)),
+		.size = cpu_to_be16(0x400),
 		.fpregs_in_use = 1,
 		.slb_count = cpu_to_be16(64),
 		.vmxregs_in_use = 0,
-		.page_ins = 0,
-	},
+		.page_ins = 0, };
 };
 
-static struct lppaca *extra_lppacas;
-static long __initdata lppaca_size;
-
-static void __init allocate_lppacas(int nr_cpus, unsigned long limit)
-{
-	if (early_cpu_has_feature(CPU_FTR_HVMODE))
-		return;
-
-	if (nr_cpus <= NR_LPPACAS)
-		return;
-
-	lppaca_size = PAGE_ALIGN(sizeof(struct lppaca) *
-				 (nr_cpus - NR_LPPACAS));
-	extra_lppacas = __va(memblock_alloc_base(lppaca_size,
-						 PAGE_SIZE, limit));
-}
-
-static struct lppaca * __init new_lppaca(int cpu)
+static struct lppaca * __init new_lppaca(int cpu, unsigned long limit)
 {
 	struct lppaca *lp;
+	size_t size = 0x400;
+
+	BUILD_BUG_ON(size < sizeof(struct lppaca));
 
 	if (early_cpu_has_feature(CPU_FTR_HVMODE))
 		return NULL;
 
-	if (cpu < NR_LPPACAS)
-		return &lppaca[cpu];
-
-	lp = extra_lppacas + (cpu - NR_LPPACAS);
-	*lp = lppaca[0];
+	lp = __va(memblock_alloc_base(size, 0x400, limit));
+	init_lppaca(lp);
 
 	return lp;
 }
 
-static void __init free_lppacas(void)
+static void __init free_lppaca(struct lppaca *lp)
 {
-	long new_size = 0, nr;
+	size_t size = 0x400;
 
 	if (early_cpu_has_feature(CPU_FTR_HVMODE))
 		return;
 
-	if (!lppaca_size)
-		return;
-	nr = num_possible_cpus() - NR_LPPACAS;
-	if (nr > 0)
-		new_size = PAGE_ALIGN(nr * sizeof(struct lppaca));
-	if (new_size >= lppaca_size)
-		return;
-
-	memblock_free(__pa(extra_lppacas) + new_size, lppaca_size - new_size);
-	lppaca_size = new_size;
+	memblock_free(__pa(lp), size);
 }
-
-#else
-
-static inline void allocate_lppacas(int nr_cpus, unsigned long limit) { }
-static inline void free_lppacas(void) { }
-
 #endif /* CONFIG_PPC_BOOK3S */
 
 #ifdef CONFIG_PPC_BOOK3S_64
@@ -167,7 +135,7 @@ EXPORT_SYMBOL(paca_ptrs);
 void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 {
 #ifdef CONFIG_PPC_PSERIES
-	new_paca->lppaca_ptr = new_lppaca(cpu);
+	new_paca->lppaca_ptr = NULL;
 #endif
 #ifdef CONFIG_PPC_BOOK3E
 	new_paca->kernel_pgd = swapper_pg_dir;
@@ -254,13 +222,15 @@ void __init allocate_pacas(void)
 	printk(KERN_DEBUG "Allocated %lu bytes for %u pacas\n",
 			size, nr_cpu_ids);
 
-	allocate_lppacas(nr_cpu_ids, limit);
-
 	allocate_slb_shadows(nr_cpu_ids, limit);
 
 	/* Can't use for_each_*_cpu, as they aren't functional yet */
-	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
 		initialise_paca(paca_ptrs[cpu], cpu);
+#ifdef CONFIG_PPC_PSERIES
+		paca_ptrs[cpu]->lppaca_ptr = new_lppaca(cpu, limit);
+#endif
+	}
 }
 
 void __init free_unused_pacas(void)
@@ -272,6 +242,9 @@ void __init free_unused_pacas(void)
 	for (cpu = 0; cpu < paca_nr_cpu_ids; cpu++) {
 		if (!cpu_possible(cpu)) {
 			unsigned long pa = __pa(paca_ptrs[cpu]);
+#ifdef CONFIG_PPC_PSERIES
+			free_lppaca(paca_ptrs[cpu]->lppaca_ptr);
+#endif
 			memblock_free(pa, sizeof(struct paca_struct));
 			paca_ptrs[cpu] = NULL;
 			size += sizeof(struct paca_struct);
@@ -288,8 +261,6 @@ void __init free_unused_pacas(void)
 	if (size)
 		printk(KERN_DEBUG "Freed %lu bytes for unused pacas\n", size);
 
-	free_lppacas();
-
 	paca_nr_cpu_ids = nr_cpu_ids;
 	paca_ptrs_size = new_ptrs_size;
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d340bda12067..61928510ed1b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -495,7 +495,8 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu *vcpu,
 		 * use 640 bytes of the structure though, so we should accept
 		 * clients that set a size of 640.
 		 */
-		if (len < 640)
+		BUILD_BUG_ON(sizeof(struct lppaca) != 640);
+		if (len < sizeof(struct lppaca))
 			break;
 		vpap = &tvcpu->arch.vpa;
 		err = 0;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index edd8d0bc9364..9c3eb62bced5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1105,7 +1105,7 @@ static void setup_cpu_associativity_change_counters(void)
 	for_each_possible_cpu(cpu) {
 		int i;
 		u8 *counts = vphn_cpu_change_counts[cpu];
-		volatile u8 *hypervisor_counts = lppaca[cpu].vphn_assoc_counts;
+		volatile u8 *hypervisor_counts = lppaca_of(cpu).vphn_assoc_counts;
 
 		for (i = 0; i < distance_ref_points_depth; i++)
 			counts[i] = hypervisor_counts[i];
@@ -1131,7 +1131,7 @@ static int update_cpu_associativity_changes_mask(void)
 	for_each_possible_cpu(cpu) {
 		int i, changed = 0;
 		u8 *counts = vphn_cpu_change_counts[cpu];
-		volatile u8 *hypervisor_counts = lppaca[cpu].vphn_assoc_counts;
+		volatile u8 *hypervisor_counts = lppaca_of(cpu).vphn_assoc_counts;
 
 		for (i = 0; i < distance_ref_points_depth; i++) {
 			if (hypervisor_counts[i] != counts[i]) {
diff --git a/arch/powerpc/platforms/pseries/kexec.c b/arch/powerpc/platforms/pseries/kexec.c
index eeb13429d685..3fe126796975 100644
--- a/arch/powerpc/platforms/pseries/kexec.c
+++ b/arch/powerpc/platforms/pseries/kexec.c
@@ -23,7 +23,12 @@
 
 void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
 {
-	/* Don't risk a hypervisor call if we're crashing */
+	/*
+	 * Don't risk a hypervisor call if we're crashing
+	 * XXX: Why? The hypervisor is not crashing. It might be better
+	 * to at least attempt unregister to avoid the hypervisor stepping
+	 * on our memory.
+	 */
 	if (firmware_has_feature(FW_FEATURE_SPLPAR) && !crash_shutdown) {
 		int ret;
 		int cpu = smp_processor_id();
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/14] powerpc/64s: allocate slb_shadow structures individually
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (2 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 03/14] powerpc/64s: allocate lppacas individually Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 05/14] mm: make memblock_alloc_base_nid non-static Nicholas Piggin
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Allocate slb_shadow structures individually.

slb_shadow structures are avoided for radix environment.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/paca.c | 65 +++++++++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 6cddb9bdc151..2699f9009286 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -72,41 +72,28 @@ static void __init free_lppaca(struct lppaca *lp)
 #ifdef CONFIG_PPC_BOOK3S_64
 
 /*
- * 3 persistent SLBs are registered here.  The buffer will be zero
+ * 3 persistent SLBs are allocated here.  The buffer will be zero
  * initially, hence will all be invaild until we actually write them.
  *
  * If you make the number of persistent SLB entries dynamic, please also
  * update PR KVM to flush and restore them accordingly.
  */
-static struct slb_shadow * __initdata slb_shadow;
-
-static void __init allocate_slb_shadows(int nr_cpus, int limit)
-{
-	int size = PAGE_ALIGN(sizeof(struct slb_shadow) * nr_cpus);
-
-	if (early_radix_enabled())
-		return;
-
-	slb_shadow = __va(memblock_alloc_base(size, PAGE_SIZE, limit));
-	memset(slb_shadow, 0, size);
-}
-
-static struct slb_shadow * __init init_slb_shadow(int cpu)
+static struct slb_shadow * __init new_slb_shadow(int cpu, unsigned long limit)
 {
 	struct slb_shadow *s;
 
-	if (early_radix_enabled())
-		return NULL;
+	if (cpu != boot_cpuid) {
+		/*
+		 * Boot CPU comes here before early_radix_enabled
+		 * is parsed (e.g., for disable_radix). So allocate
+		 * always and this will be fixed up in free_unused_pacas.
+		 */
+		if (early_radix_enabled())
+			return NULL;
+	}
 
-	s = &slb_shadow[cpu];
-
-	/*
-	 * When we come through here to initialise boot_paca, the slb_shadow
-	 * buffers are not allocated yet. That's OK, we'll get one later in
-	 * boot, but make sure we don't corrupt memory at 0.
-	 */
-	if (!slb_shadow)
-		return NULL;
+	s = __va(memblock_alloc_base(sizeof(*s), L1_CACHE_BYTES, limit));
+	memset(s, 0, sizeof(*s));
 
 	s->persistent = cpu_to_be32(SLB_NUM_BOLTED);
 	s->buffer_length = cpu_to_be32(sizeof(*s));
@@ -114,10 +101,6 @@ static struct slb_shadow * __init init_slb_shadow(int cpu)
 	return s;
 }
 
-#else /* !CONFIG_PPC_BOOK3S_64 */
-
-static void __init allocate_slb_shadows(int nr_cpus, int limit) { }
-
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 /* The Paca is an array with one entry per processor.  Each contains an
@@ -151,7 +134,7 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 	new_paca->__current = &init_task;
 	new_paca->data_offset = 0xfeeeeeeeeeeeeeeeULL;
 #ifdef CONFIG_PPC_BOOK3S_64
-	new_paca->slb_shadow_ptr = init_slb_shadow(cpu);
+	new_paca->slb_shadow_ptr = NULL;
 #endif
 
 #ifdef CONFIG_PPC_BOOK3E
@@ -222,13 +205,16 @@ void __init allocate_pacas(void)
 	printk(KERN_DEBUG "Allocated %lu bytes for %u pacas\n",
 			size, nr_cpu_ids);
 
-	allocate_slb_shadows(nr_cpu_ids, limit);
-
 	/* Can't use for_each_*_cpu, as they aren't functional yet */
 	for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
-		initialise_paca(paca_ptrs[cpu], cpu);
+		struct paca_struct *paca = paca_ptrs[cpu];
+
+		initialise_paca(paca, cpu);
 #ifdef CONFIG_PPC_PSERIES
-		paca_ptrs[cpu]->lppaca_ptr = new_lppaca(cpu, limit);
+		paca->lppaca_ptr = new_lppaca(cpu, limit);
+#endif
+#ifdef CONFIG_PPC_BOOK3S_64
+		paca->slb_shadow_ptr = new_slb_shadow(cpu, limit);
 #endif
 	}
 }
@@ -263,6 +249,15 @@ void __init free_unused_pacas(void)
 
 	paca_nr_cpu_ids = nr_cpu_ids;
 	paca_ptrs_size = new_ptrs_size;
+
+#ifdef CONFIG_PPC_BOOK3S_64
+	if (early_radix_enabled()) {
+		/* Ugly fixup, see new_slb_shadow() */
+		memblock_free(__pa(paca_ptrs[boot_cpuid]->slb_shadow_ptr),
+				sizeof(struct slb_shadow));
+		paca_ptrs[boot_cpuid]->slb_shadow_ptr = NULL;
+	}
+#endif
 }
 
 void copy_mm_to_paca(struct mm_struct *mm)
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/14] mm: make memblock_alloc_base_nid non-static
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (3 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 04/14] powerpc/64s: allocate slb_shadow structures individually Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-13 12:06     ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 06/14] powerpc/mm/numa: move numa topology discovery earlier Nicholas Piggin
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

This will be used by powerpc to allocate per-cpu stacks and other
data structures node-local where possible.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 include/linux/memblock.h | 5 ++++-
 mm/memblock.c            | 2 +-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8be5077efb5f..8cab51398705 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -316,9 +316,12 @@ static inline bool memblock_bottom_up(void)
 #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
 #define MEMBLOCK_ALLOC_ACCESSIBLE	0
 
-phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
+phys_addr_t memblock_alloc_range(phys_addr_t size, phys_addr_t align,
 					phys_addr_t start, phys_addr_t end,
 					ulong flags);
+phys_addr_t memblock_alloc_base_nid(phys_addr_t size,
+					phys_addr_t align, phys_addr_t max_addr,
+					int nid, ulong flags);
 phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align,
 				phys_addr_t max_addr);
 phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align,
diff --git a/mm/memblock.c b/mm/memblock.c
index 5a9ca2a1751b..cea2af494da0 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1190,7 +1190,7 @@ phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
 					flags);
 }
 
-static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
+phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
 					phys_addr_t align, phys_addr_t max_addr,
 					int nid, ulong flags)
 {
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/14] powerpc/mm/numa: move numa topology discovery earlier
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (4 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 05/14] mm: make memblock_alloc_base_nid non-static Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 07/14] powerpc/64: move default SPR recording Nicholas Piggin
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Split sparsemem initialisation from basic numa topology discovery.
Move the parsing earlier in boot, before pacas are allocated.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/setup.h   |  1 +
 arch/powerpc/kernel/setup-common.c |  3 +++
 arch/powerpc/mm/mem.c              |  5 ++++-
 arch/powerpc/mm/numa.c             | 32 +++++++++++++++++++-------------
 4 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 469b7fdc9be4..d2bf233aebd5 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -23,6 +23,7 @@ extern void reloc_got2(unsigned long);
 #define PTRRELOC(x)	((typeof(x)) add_reloc_offset((unsigned long)(x)))
 
 void check_for_initrd(void);
+void mem_topology_setup(void);
 void initmem_init(void);
 void setup_panic(void);
 #define ARCH_PANIC_TIMEOUT 180
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index d73ec518ef80..9eaf26318d20 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -888,6 +888,9 @@ void __init setup_arch(char **cmdline_p)
 	/* Check the SMT related command line arguments (ppc64). */
 	check_smt_enabled();
 
+	/* Parse memory topology */
+	mem_topology_setup();
+
 	/* On BookE, setup per-core TLB data structures. */
 	setup_tlb_core_data();
 
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index fe8c61149fb8..4eee46ea4d96 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -212,7 +212,7 @@ walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
 EXPORT_SYMBOL_GPL(walk_system_ram_range);
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
-void __init initmem_init(void)
+void __init mem_topology_setup(void)
 {
 	max_low_pfn = max_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
 	min_low_pfn = MEMORY_START >> PAGE_SHIFT;
@@ -224,7 +224,10 @@ void __init initmem_init(void)
 	 * memblock_regions
 	 */
 	memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
+}
 
+void __init initmem_init(void)
+{
 	/* XXX need to clip this if using highmem? */
 	sparse_memory_present_with_active_regions(0);
 	sparse_init();
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9c3eb62bced5..57a5029b4521 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -831,18 +831,13 @@ static void __init find_possible_nodes(void)
 	of_node_put(rtas);
 }
 
-void __init initmem_init(void)
+void __init mem_topology_setup(void)
 {
-	int nid, cpu;
-
-	max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
-	max_pfn = max_low_pfn;
+	int cpu;
 
 	if (parse_numa_properties())
 		setup_nonnuma();
 
-	memblock_dump_all();
-
 	/*
 	 * Modify the set of possible NUMA nodes to reflect information
 	 * available about the set of online nodes, and the set of nodes
@@ -853,6 +848,23 @@ void __init initmem_init(void)
 
 	find_possible_nodes();
 
+	setup_node_to_cpumask_map();
+
+	reset_numa_cpu_lookup_table();
+
+	for_each_present_cpu(cpu)
+		numa_setup_cpu(cpu);
+}
+
+void __init initmem_init(void)
+{
+	int nid;
+
+	max_low_pfn = memblock_end_of_DRAM() >> PAGE_SHIFT;
+	max_pfn = max_low_pfn;
+
+	memblock_dump_all();
+
 	for_each_online_node(nid) {
 		unsigned long start_pfn, end_pfn;
 
@@ -863,10 +875,6 @@ void __init initmem_init(void)
 
 	sparse_init();
 
-	setup_node_to_cpumask_map();
-
-	reset_numa_cpu_lookup_table();
-
 	/*
 	 * We need the numa_cpu_lookup_table to be accurate for all CPUs,
 	 * even before we online them, so that we can use cpu_to_{node,mem}
@@ -876,8 +884,6 @@ void __init initmem_init(void)
 	 */
 	cpuhp_setup_state_nocalls(CPUHP_POWER_NUMA_PREPARE, "powerpc/numa:prepare",
 				  ppc_numa_cpu_prepare, ppc_numa_cpu_dead);
-	for_each_present_cpu(cpu)
-		numa_setup_cpu(cpu);
 }
 
 static int __init early_numa(char *p)
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/14] powerpc/64: move default SPR recording
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (5 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 06/14] powerpc/mm/numa: move numa topology discovery earlier Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-13 12:25   ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 08/14] powerpc/setup: cpu_to_phys_id array Nicholas Piggin
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Move this into the early setup code, and don't iterate over CPU masks.
We don't want to call into sysfs so early from setup, and a future patch
won't initialize CPU masks by the time this is called.
---
 arch/powerpc/kernel/paca.c     |  3 +++
 arch/powerpc/kernel/setup.h    |  9 +++------
 arch/powerpc/kernel/setup_64.c |  8 ++++++++
 arch/powerpc/kernel/sysfs.c    | 18 +++++++-----------
 4 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 2699f9009286..e560072f122b 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -133,6 +133,9 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 	new_paca->kexec_state = KEXEC_STATE_NONE;
 	new_paca->__current = &init_task;
 	new_paca->data_offset = 0xfeeeeeeeeeeeeeeeULL;
+#ifdef CONFIG_PPC64
+	new_paca->dscr_default = spr_default_dscr;
+#endif
 #ifdef CONFIG_PPC_BOOK3S_64
 	new_paca->slb_shadow_ptr = NULL;
 #endif
diff --git a/arch/powerpc/kernel/setup.h b/arch/powerpc/kernel/setup.h
index 3fc11e30308f..d144df54ad40 100644
--- a/arch/powerpc/kernel/setup.h
+++ b/arch/powerpc/kernel/setup.h
@@ -45,14 +45,11 @@ void emergency_stack_init(void);
 static inline void emergency_stack_init(void) { };
 #endif
 
-#ifdef CONFIG_PPC64
-void record_spr_defaults(void);
-#else
-static inline void record_spr_defaults(void) { };
-#endif
-
 #ifdef CONFIG_PPC64
 u64 ppc64_bolted_size(void);
+
+/* Default SPR values from firmware/kexec */
+extern unsigned long spr_default_dscr;
 #endif
 
 /*
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 3ce12af4906f..dde34d35d1e7 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -254,6 +254,14 @@ static void cpu_ready_for_interrupts(void)
 	get_paca()->kernel_msr = MSR_KERNEL;
 }
 
+unsigned long spr_default_dscr = 0;
+
+void __init record_spr_defaults(void)
+{
+	if (early_cpu_has_feature(CPU_FTR_DSCR))
+		spr_default_dscr = mfspr(SPRN_DSCR);
+}
+
 /*
  * Early initialization entry point. This is called by head.S
  * with MMU translation disabled. We rely on the "feature" of
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 1f9d94dac3a6..ab4eb61fe659 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -588,21 +588,17 @@ static DEVICE_ATTR(dscr_default, 0600,
 
 static void sysfs_create_dscr_default(void)
 {
-	int err = 0;
-	if (cpu_has_feature(CPU_FTR_DSCR))
-		err = device_create_file(cpu_subsys.dev_root, &dev_attr_dscr_default);
-}
-
-void __init record_spr_defaults(void)
-{
-	int cpu;
-
 	if (cpu_has_feature(CPU_FTR_DSCR)) {
-		dscr_default = mfspr(SPRN_DSCR);
-		for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+		int err = 0;
+		int cpu;
+
+		for_each_possible_cpu(cpu)
 			paca_ptrs[cpu]->dscr_default = dscr_default;
+
+		err = device_create_file(cpu_subsys.dev_root, &dev_attr_dscr_default);
 	}
 }
+
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/14] powerpc/setup: cpu_to_phys_id array
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (6 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 07/14] powerpc/64: move default SPR recording Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-29  5:51   ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered Nicholas Piggin
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Build an array that finds hardware CPU number from logical CPU
number in firmware CPU discovery. Use that rather than setting
paca of other CPUs directly, to begin with. Subsequent patch will
not have pacas allocated at this point.
---
 arch/powerpc/include/asm/smp.h     |  1 +
 arch/powerpc/kernel/prom.c         |  7 +++++++
 arch/powerpc/kernel/setup-common.c | 15 ++++++++++++++-
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index ec7b299350d9..cfecfee1194b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -31,6 +31,7 @@
 
 extern int boot_cpuid;
 extern int spinning_secondaries;
+extern u32 *cpu_to_phys_id;
 
 extern void cpu_die(void);
 extern int cpu_to_chip_id(int cpu);
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 4dffef947b8a..5979e34ba90e 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -874,5 +874,12 @@ EXPORT_SYMBOL(cpu_to_chip_id);
 
 bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
 {
+	/*
+	 * Early firmware scanning must use this rather than
+	 * get_hard_smp_processor_id because we don't have pacas allocated
+	 * until memory topology is discovered.
+	 */
+	if (cpu_to_phys_id != NULL)
+		return (int)phys_id == cpu_to_phys_id[cpu];
 	return (int)phys_id == get_hard_smp_processor_id(cpu);
 }
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 9eaf26318d20..bd79a5644c78 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -437,6 +437,8 @@ static void __init cpu_init_thread_core_maps(int tpc)
 }
 
 
+u32 *cpu_to_phys_id = NULL;
+
 /**
  * setup_cpu_maps - initialize the following cpu maps:
  *                  cpu_possible_mask
@@ -463,6 +465,10 @@ void __init smp_setup_cpu_maps(void)
 
 	DBG("smp_setup_cpu_maps()\n");
 
+	cpu_to_phys_id = __va(memblock_alloc(nr_cpu_ids * sizeof(u32),
+							__alignof__(u32)));
+	memset(cpu_to_phys_id, 0, nr_cpu_ids * sizeof(u32));
+
 	for_each_node_by_type(dn, "cpu") {
 		const __be32 *intserv;
 		__be32 cpu_be;
@@ -480,6 +486,7 @@ void __init smp_setup_cpu_maps(void)
 			intserv = of_get_property(dn, "reg", &len);
 			if (!intserv) {
 				cpu_be = cpu_to_be32(cpu);
+				/* XXX: what is this? uninitialized?? */
 				intserv = &cpu_be;	/* assume logical == phys */
 				len = 4;
 			}
@@ -499,8 +506,8 @@ void __init smp_setup_cpu_maps(void)
 						"enable-method", "spin-table");
 
 			set_cpu_present(cpu, avail);
-			set_hard_smp_processor_id(cpu, be32_to_cpu(intserv[j]));
 			set_cpu_possible(cpu, true);
+			cpu_to_phys_id[cpu] = be32_to_cpu(intserv[j]);
 			cpu++;
 		}
 
@@ -570,6 +577,12 @@ void __init smp_setup_cpu_maps(void)
 	setup_nr_cpu_ids();
 
 	free_unused_pacas();
+
+	for_each_possible_cpu(cpu) {
+		if (cpu == smp_processor_id())
+			continue;
+		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
+	}
 }
 #endif /* CONFIG_SMP */
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (7 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 08/14] powerpc/setup: cpu_to_phys_id array Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-29  5:51   ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 10/14] powerpc/64: allocate pacas per node Nicholas Piggin
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

---
 arch/powerpc/include/asm/paca.h    |  3 +-
 arch/powerpc/kernel/paca.c         | 90 ++++++++++++--------------------------
 arch/powerpc/kernel/prom.c         |  5 ++-
 arch/powerpc/kernel/setup-common.c | 24 +++++++---
 4 files changed, 51 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index f266b0a7be95..407a8076edd7 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -252,7 +252,8 @@ extern void copy_mm_to_paca(struct mm_struct *mm);
 extern struct paca_struct **paca_ptrs;
 extern void initialise_paca(struct paca_struct *new_paca, int cpu);
 extern void setup_paca(struct paca_struct *new_paca);
-extern void allocate_pacas(void);
+extern void allocate_paca_ptrs(void);
+extern void allocate_paca(int cpu);
 extern void free_unused_pacas(void);
 
 #else /* CONFIG_PPC64 */
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index e560072f122b..12d329467631 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -57,16 +57,6 @@ static struct lppaca * __init new_lppaca(int cpu, unsigned long limit)
 
 	return lp;
 }
-
-static void __init free_lppaca(struct lppaca *lp)
-{
-	size_t size = 0x400;
-
-	if (early_cpu_has_feature(CPU_FTR_HVMODE))
-		return;
-
-	memblock_free(__pa(lp), size);
-}
 #endif /* CONFIG_PPC_BOOK3S */
 
 #ifdef CONFIG_PPC_BOOK3S_64
@@ -169,12 +159,24 @@ void setup_paca(struct paca_struct *new_paca)
 
 static int __initdata paca_nr_cpu_ids;
 static int __initdata paca_ptrs_size;
+static int __initdata paca_struct_size;
+
+void __init allocate_paca_ptrs(void)
+{
+	paca_nr_cpu_ids = nr_cpu_ids;
+
+	paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
+	paca_ptrs = __va(memblock_alloc(paca_ptrs_size, 0));
+	memset(paca_ptrs, 0x88, paca_ptrs_size);
+}
 
-void __init allocate_pacas(void)
+void __init allocate_paca(int cpu)
 {
 	u64 limit;
-	unsigned long size = 0;
-	int cpu;
+	unsigned long pa;
+	struct paca_struct *paca;
+
+	BUG_ON(cpu >= paca_nr_cpu_ids);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	/*
@@ -186,69 +188,30 @@ void __init allocate_pacas(void)
 	limit = ppc64_rma_size;
 #endif
 
-	paca_nr_cpu_ids = nr_cpu_ids;
-
-	paca_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
-	paca_ptrs = __va(memblock_alloc_base(paca_ptrs_size, 0, limit));
-	memset(paca_ptrs, 0, paca_ptrs_size);
-
-	size += paca_ptrs_size;
-
-	for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
-		unsigned long pa;
-
-		pa = memblock_alloc_base(sizeof(struct paca_struct),
-						L1_CACHE_BYTES, limit);
-		paca_ptrs[cpu] = __va(pa);
-		memset(paca_ptrs[cpu], 0, sizeof(struct paca_struct));
-
-		size += sizeof(struct paca_struct);
-	}
-
-	printk(KERN_DEBUG "Allocated %lu bytes for %u pacas\n",
-			size, nr_cpu_ids);
-
-	/* Can't use for_each_*_cpu, as they aren't functional yet */
-	for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
-		struct paca_struct *paca = paca_ptrs[cpu];
+	pa = memblock_alloc_base(sizeof(struct paca_struct),
+					L1_CACHE_BYTES, limit);
+	paca = __va(pa);
+	paca_ptrs[cpu] = paca;
+	memset(paca, 0, sizeof(struct paca_struct));
 
-		initialise_paca(paca, cpu);
+	initialise_paca(paca, cpu);
 #ifdef CONFIG_PPC_PSERIES
-		paca->lppaca_ptr = new_lppaca(cpu, limit);
+	paca->lppaca_ptr = new_lppaca(cpu, limit);
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
-		paca->slb_shadow_ptr = new_slb_shadow(cpu, limit);
+	paca->slb_shadow_ptr = new_slb_shadow(cpu, limit);
 #endif
-	}
+	paca_struct_size += sizeof(struct paca_struct);
 }
 
 void __init free_unused_pacas(void)
 {
-	unsigned long size = 0;
 	int new_ptrs_size;
-	int cpu;
-
-	for (cpu = 0; cpu < paca_nr_cpu_ids; cpu++) {
-		if (!cpu_possible(cpu)) {
-			unsigned long pa = __pa(paca_ptrs[cpu]);
-#ifdef CONFIG_PPC_PSERIES
-			free_lppaca(paca_ptrs[cpu]->lppaca_ptr);
-#endif
-			memblock_free(pa, sizeof(struct paca_struct));
-			paca_ptrs[cpu] = NULL;
-			size += sizeof(struct paca_struct);
-		}
-	}
 
 	new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
-	if (new_ptrs_size < paca_ptrs_size) {
+	if (new_ptrs_size < paca_ptrs_size)
 		memblock_free(__pa(paca_ptrs) + new_ptrs_size,
 					paca_ptrs_size - new_ptrs_size);
-		size += paca_ptrs_size - new_ptrs_size;
-	}
-
-	if (size)
-		printk(KERN_DEBUG "Freed %lu bytes for unused pacas\n", size);
 
 	paca_nr_cpu_ids = nr_cpu_ids;
 	paca_ptrs_size = new_ptrs_size;
@@ -261,6 +224,9 @@ void __init free_unused_pacas(void)
 		paca_ptrs[boot_cpuid]->slb_shadow_ptr = NULL;
 	}
 #endif
+
+	printk(KERN_DEBUG "Allocated %u bytes for %u pacas\n",
+			paca_ptrs_size + paca_struct_size, nr_cpu_ids);
 }
 
 void copy_mm_to_paca(struct mm_struct *mm)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 5979e34ba90e..a8b6ddbb540f 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -365,7 +365,6 @@ static int __init early_init_dt_scan_cpus(unsigned long node,
 	DBG("boot cpu: logical %d physical %d\n", found,
 	    be32_to_cpu(intserv[found_thread]));
 	boot_cpuid = found;
-	set_hard_smp_processor_id(found, be32_to_cpu(intserv[found_thread]));
 
 	/*
 	 * PAPR defines "logical" PVR values for cpus that
@@ -403,7 +402,9 @@ static int __init early_init_dt_scan_cpus(unsigned long node,
 		cur_cpu_spec->cpu_features &= ~CPU_FTR_SMT;
 	else if (!dt_cpu_ftrs_in_use())
 		cur_cpu_spec->cpu_features |= CPU_FTR_SMT;
+	allocate_paca(boot_cpuid);
 #endif
+	set_hard_smp_processor_id(found, be32_to_cpu(intserv[found_thread]));
 
 	return 0;
 }
@@ -744,7 +745,7 @@ void __init early_init_devtree(void *params)
 	 * FIXME .. and the initrd too? */
 	move_device_tree();
 
-	allocate_pacas();
+	allocate_paca_ptrs();
 
 	DBG("Scanning CPUs ...\n");
 
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index bd79a5644c78..af7a47c8fe10 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -577,12 +577,6 @@ void __init smp_setup_cpu_maps(void)
 	setup_nr_cpu_ids();
 
 	free_unused_pacas();
-
-	for_each_possible_cpu(cpu) {
-		if (cpu == smp_processor_id())
-			continue;
-		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
-	}
 }
 #endif /* CONFIG_SMP */
 
@@ -848,6 +842,23 @@ static __init void print_system_info(void)
 	pr_info("-----------------------------------------------------\n");
 }
 
+#ifdef CONFIG_SMP
+static void smp_setup_pacas(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		if (cpu == smp_processor_id())
+			continue;
+		allocate_paca(cpu);
+		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
+	}
+
+	memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
+	cpu_to_phys_id = NULL;
+}
+#endif
+
 /*
  * Called into from start_kernel this initializes memblock, which is used
  * to manage page allocation until mem_init is called.
@@ -915,6 +926,7 @@ void __init setup_arch(char **cmdline_p)
 	 * so smp_release_cpus() does nothing for them.
 	 */
 #ifdef CONFIG_SMP
+	smp_setup_pacas();
 	smp_release_cpus();
 #endif
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/14] powerpc/64: allocate pacas per node
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (8 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-29  5:50   ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 11/14] powerpc/64: allocate per-cpu stacks node-local if possible Nicholas Piggin
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Per-node allocations are possible on 64s with radix that does
not have the bolted SLB limitation.

Hash would be able to do the same if all CPUs had the bottom of
their node-local memory bolted as well. This is left as an
exercise for the reader.
---
 arch/powerpc/kernel/paca.c     | 41 +++++++++++++++++++++++++++++++++++------
 arch/powerpc/kernel/setup_64.c |  4 ++++
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 12d329467631..470ce21af8b5 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -20,6 +20,37 @@
 
 #include "setup.h"
 
+static void *__init alloc_paca_data(unsigned long size, unsigned long align,
+				unsigned long limit, int cpu)
+{
+	unsigned long pa;
+	int nid;
+
+	/*
+	 * boot_cpuid paca is allocated very early before cpu_to_node is up.
+	 * Set bottom-up mode, because the boot CPU should be on node-0,
+	 * which will put its paca in the right place.
+	 */
+	if (cpu == boot_cpuid) {
+		nid = -1;
+		memblock_set_bottom_up(true);
+	} else {
+		nid = early_cpu_to_node(cpu);
+	}
+
+	pa = memblock_alloc_base_nid(size, align, limit, nid, MEMBLOCK_NONE);
+	if (!pa) {
+		pa = memblock_alloc_base(size, align, limit);
+		if (!pa)
+			panic("cannot allocate paca data");
+	}
+
+	if (cpu == boot_cpuid)
+		memblock_set_bottom_up(false);
+
+	return __va(pa);
+}
+
 #ifdef CONFIG_PPC_PSERIES
 
 /*
@@ -52,7 +83,7 @@ static struct lppaca * __init new_lppaca(int cpu, unsigned long limit)
 	if (early_cpu_has_feature(CPU_FTR_HVMODE))
 		return NULL;
 
-	lp = __va(memblock_alloc_base(size, 0x400, limit));
+	lp = alloc_paca_data(size, 0x400, limit, cpu);
 	init_lppaca(lp);
 
 	return lp;
@@ -82,7 +113,7 @@ static struct slb_shadow * __init new_slb_shadow(int cpu, unsigned long limit)
 			return NULL;
 	}
 
-	s = __va(memblock_alloc_base(sizeof(*s), L1_CACHE_BYTES, limit));
+	s = alloc_paca_data(sizeof(*s), L1_CACHE_BYTES, limit, cpu);
 	memset(s, 0, sizeof(*s));
 
 	s->persistent = cpu_to_be32(SLB_NUM_BOLTED);
@@ -173,7 +204,6 @@ void __init allocate_paca_ptrs(void)
 void __init allocate_paca(int cpu)
 {
 	u64 limit;
-	unsigned long pa;
 	struct paca_struct *paca;
 
 	BUG_ON(cpu >= paca_nr_cpu_ids);
@@ -188,9 +218,8 @@ void __init allocate_paca(int cpu)
 	limit = ppc64_rma_size;
 #endif
 
-	pa = memblock_alloc_base(sizeof(struct paca_struct),
-					L1_CACHE_BYTES, limit);
-	paca = __va(pa);
+	paca = alloc_paca_data(sizeof(struct paca_struct), L1_CACHE_BYTES,
+				limit, cpu);
 	paca_ptrs[cpu] = paca;
 	memset(paca, 0, sizeof(struct paca_struct));
 
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index dde34d35d1e7..02fa358982e6 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -312,6 +312,10 @@ void __init early_setup(unsigned long dt_ptr)
 	early_init_devtree(__va(dt_ptr));
 
 	/* Now we know the logical id of our boot cpu, setup the paca. */
+	if (boot_cpuid != 0) {
+		/* Poison paca_ptrs[0] again if it's not the boot cpu */
+		memset(&paca_ptrs[0], 0x88, sizeof(paca_ptrs[0]));
+	}
 	setup_paca(paca_ptrs[boot_cpuid]);
 	fixup_boot_paca();
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/14] powerpc/64: allocate per-cpu stacks node-local if possible
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (9 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 10/14] powerpc/64: allocate pacas per node Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 12/14] powerpc: pass node id into create_section_mapping Nicholas Piggin
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/setup_64.c | 51 ++++++++++++++++++++++++++----------------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 02fa358982e6..16ea71fa1ead 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -611,6 +611,21 @@ __init u64 ppc64_bolted_size(void)
 #endif
 }
 
+static void *__init alloc_stack(unsigned long limit, int cpu)
+{
+	unsigned long pa;
+
+	pa = memblock_alloc_base_nid(THREAD_SIZE, THREAD_SIZE, limit,
+					early_cpu_to_node(cpu), MEMBLOCK_NONE);
+	if (!pa) {
+		pa = memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit);
+		if (!pa)
+			panic("cannot allocate stacks");
+	}
+
+	return __va(pa);
+}
+
 void __init irqstack_early_init(void)
 {
 	u64 limit = ppc64_bolted_size();
@@ -622,12 +637,8 @@ void __init irqstack_early_init(void)
 	 * accessed in realmode.
 	 */
 	for_each_possible_cpu(i) {
-		softirq_ctx[i] = (struct thread_info *)
-			__va(memblock_alloc_base(THREAD_SIZE,
-					    THREAD_SIZE, limit));
-		hardirq_ctx[i] = (struct thread_info *)
-			__va(memblock_alloc_base(THREAD_SIZE,
-					    THREAD_SIZE, limit));
+		softirq_ctx[i] = alloc_stack(limit, i);
+		hardirq_ctx[i] = alloc_stack(limit, i);
 	}
 }
 
@@ -635,20 +646,21 @@ void __init irqstack_early_init(void)
 void __init exc_lvl_early_init(void)
 {
 	unsigned int i;
-	unsigned long sp;
 
 	for_each_possible_cpu(i) {
-		sp = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
-		critirq_ctx[i] = (struct thread_info *)__va(sp);
-		paca_ptrs[i]->crit_kstack = __va(sp + THREAD_SIZE);
+		void *sp;
 
-		sp = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
-		dbgirq_ctx[i] = (struct thread_info *)__va(sp);
-		paca_ptrs[i]->dbg_kstack = __va(sp + THREAD_SIZE);
+		sp = alloc_stack(ULONG_MAX, i);
+		critirq_ctx[i] = sp;
+		paca_ptrs[i]->crit_kstack = sp + THREAD_SIZE;
 
-		sp = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
-		mcheckirq_ctx[i] = (struct thread_info *)__va(sp);
-		paca_ptrs[i]->mc_kstack = __va(sp + THREAD_SIZE);
+		sp = alloc_stack(ULONG_MAX, i);
+		dbgirq_ctx[i] = sp;
+		paca_ptrs[i]->dbg_kstack = sp + THREAD_SIZE;
+
+		sp = alloc_stack(ULONG_MAX, i);
+		mcheckirq_ctx[i] = sp;
+		paca_ptrs[i]->mc_kstack = sp + THREAD_SIZE;
 	}
 
 	if (cpu_has_feature(CPU_FTR_DEBUG_LVL_EXC))
@@ -702,20 +714,21 @@ void __init emergency_stack_init(void)
 
 	for_each_possible_cpu(i) {
 		struct thread_info *ti;
-		ti = __va(memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit));
+
+		ti = alloc_stack(limit, i);
 		memset(ti, 0, THREAD_SIZE);
 		emerg_stack_init_thread_info(ti, i);
 		paca_ptrs[i]->emergency_sp = (void *)ti + THREAD_SIZE;
 
 #ifdef CONFIG_PPC_BOOK3S_64
 		/* emergency stack for NMI exception handling. */
-		ti = __va(memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit));
+		ti = alloc_stack(limit, i);
 		memset(ti, 0, THREAD_SIZE);
 		emerg_stack_init_thread_info(ti, i);
 		paca_ptrs[i]->nmi_emergency_sp = (void *)ti + THREAD_SIZE;
 
 		/* emergency stack for machine check exception handling. */
-		ti = __va(memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit));
+		ti = alloc_stack(limit, i);
 		memset(ti, 0, THREAD_SIZE);
 		emerg_stack_init_thread_info(ti, i);
 		paca_ptrs[i]->mc_emergency_sp = (void *)ti + THREAD_SIZE;
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/14] powerpc: pass node id into create_section_mapping
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (10 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 11/14] powerpc/64: allocate per-cpu stacks node-local if possible Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-29  5:51   ` Michael Ellerman
  2018-02-13 15:08 ` [PATCH 13/14] powerpc/64s/radix: split early page table mapping to its own function Nicholas Piggin
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/hash.h  | 2 +-
 arch/powerpc/include/asm/book3s/64/radix.h | 2 +-
 arch/powerpc/include/asm/sparsemem.h       | 2 +-
 arch/powerpc/mm/hash_utils_64.c            | 2 +-
 arch/powerpc/mm/mem.c                      | 4 ++--
 arch/powerpc/mm/pgtable-book3s64.c         | 6 +++---
 arch/powerpc/mm/pgtable-radix.c            | 4 ++--
 7 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 0920eff731b3..b1ace9619e94 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -201,7 +201,7 @@ extern int __meminit hash__vmemmap_create_mapping(unsigned long start,
 extern void hash__vmemmap_remove_mapping(unsigned long start,
 				     unsigned long page_size);
 
-int hash__create_section_mapping(unsigned long start, unsigned long end);
+int hash__create_section_mapping(unsigned long start, unsigned long end, int nid);
 int hash__remove_section_mapping(unsigned long start, unsigned long end);
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 365010f66570..705193e7192f 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -313,7 +313,7 @@ static inline unsigned long radix__get_tree_size(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int radix__create_section_mapping(unsigned long start, unsigned long end);
+int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
index a7916ee6dfb6..bc66712bdc3c 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -17,7 +17,7 @@
 #endif /* CONFIG_SPARSEMEM */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-extern int create_section_mapping(unsigned long start, unsigned long end);
+extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 7d07c7e17db6..ceb5494804b2 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -781,7 +781,7 @@ void resize_hpt_for_hotplug(unsigned long new_mem_size)
 	}
 }
 
-int hash__create_section_mapping(unsigned long start, unsigned long end)
+int hash__create_section_mapping(unsigned long start, unsigned long end, int nid)
 {
 	int rc = htab_bolt_mapping(start, end, __pa(start),
 				   pgprot_val(PAGE_KERNEL), mmu_linear_psize,
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 4eee46ea4d96..f50ce66dd6bd 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -117,7 +117,7 @@ int memory_add_physaddr_to_nid(u64 start)
 }
 #endif
 
-int __weak create_section_mapping(unsigned long start, unsigned long end)
+int __weak create_section_mapping(unsigned long start, unsigned long end, int nid)
 {
 	return -ENODEV;
 }
@@ -137,7 +137,7 @@ int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
 	resize_hpt_for_hotplug(memblock_phys_mem_size());
 
 	start = (unsigned long)__va(start);
-	rc = create_section_mapping(start, start + size);
+	rc = create_section_mapping(start, start + size, nid);
 	if (rc) {
 		pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
 			start, start + size, rc);
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 422e80253a33..c736280068ce 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -155,12 +155,12 @@ void mmu_cleanup_all(void)
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-int create_section_mapping(unsigned long start, unsigned long end)
+int create_section_mapping(unsigned long start, unsigned long end, int nid)
 {
 	if (radix_enabled())
-		return radix__create_section_mapping(start, end);
+		return radix__create_section_mapping(start, end, nid);
 
-	return hash__create_section_mapping(start, end);
+	return hash__create_section_mapping(start, end, nid);
 }
 
 int remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 328ff9abc333..435b19e74508 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -862,9 +862,9 @@ static void remove_pagetable(unsigned long start, unsigned long end)
 	radix__flush_tlb_kernel_range(start, end);
 }
 
-int __ref radix__create_section_mapping(unsigned long start, unsigned long end)
+int __ref radix__create_section_mapping(unsigned long start, unsigned long end, int nid)
 {
-	return create_physical_mapping(start, end);
+	return create_physical_mapping(start, end, nid);
 }
 
 int radix__remove_section_mapping(unsigned long start, unsigned long end)
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/14] powerpc/64s/radix: split early page table mapping to its own function
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (11 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 12/14] powerpc: pass node id into create_section_mapping Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-02-13 15:08 ` [PATCH 14/14] powerpc/64s/radix: allocate kernel page tables node-local if possible Nicholas Piggin
  2018-03-07 10:50 ` [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Michael Ellerman
  14 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/pgtable-radix.c | 114 +++++++++++++++++++++++-----------------
 1 file changed, 66 insertions(+), 48 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 435b19e74508..4c5cc69c92c2 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -58,6 +58,50 @@ static __ref void *early_alloc_pgtable(unsigned long size)
 	return pt;
 }
 
+static int early_map_kernel_page(unsigned long ea, unsigned long pa,
+			  pgprot_t flags,
+			  unsigned int map_page_size)
+{
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
+
+	pgdp = pgd_offset_k(ea);
+	if (pgd_none(*pgdp)) {
+		pudp = early_alloc_pgtable(PUD_TABLE_SIZE);
+		BUG_ON(pudp == NULL);
+		pgd_populate(&init_mm, pgdp, pudp);
+	}
+	pudp = pud_offset(pgdp, ea);
+	if (map_page_size == PUD_SIZE) {
+		ptep = (pte_t *)pudp;
+		goto set_the_pte;
+	}
+	if (pud_none(*pudp)) {
+		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE);
+		BUG_ON(pmdp == NULL);
+		pud_populate(&init_mm, pudp, pmdp);
+	}
+	pmdp = pmd_offset(pudp, ea);
+	if (map_page_size == PMD_SIZE) {
+		ptep = pmdp_ptep(pmdp);
+		goto set_the_pte;
+	}
+	if (!pmd_present(*pmdp)) {
+		ptep = early_alloc_pgtable(PAGE_SIZE);
+		BUG_ON(ptep == NULL);
+		pmd_populate_kernel(&init_mm, pmdp, ptep);
+	}
+	ptep = pte_offset_kernel(pmdp, ea);
+
+set_the_pte:
+	set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, flags));
+	smp_wmb();
+	return 0;
+}
+
+
 int radix__map_kernel_page(unsigned long ea, unsigned long pa,
 			  pgprot_t flags,
 			  unsigned int map_page_size)
@@ -70,54 +114,28 @@ int radix__map_kernel_page(unsigned long ea, unsigned long pa,
 	 * Make sure task size is correct as per the max adddr
 	 */
 	BUILD_BUG_ON(TASK_SIZE_USER64 > RADIX_PGTABLE_RANGE);
-	if (slab_is_available()) {
-		pgdp = pgd_offset_k(ea);
-		pudp = pud_alloc(&init_mm, pgdp, ea);
-		if (!pudp)
-			return -ENOMEM;
-		if (map_page_size == PUD_SIZE) {
-			ptep = (pte_t *)pudp;
-			goto set_the_pte;
-		}
-		pmdp = pmd_alloc(&init_mm, pudp, ea);
-		if (!pmdp)
-			return -ENOMEM;
-		if (map_page_size == PMD_SIZE) {
-			ptep = pmdp_ptep(pmdp);
-			goto set_the_pte;
-		}
-		ptep = pte_alloc_kernel(pmdp, ea);
-		if (!ptep)
-			return -ENOMEM;
-	} else {
-		pgdp = pgd_offset_k(ea);
-		if (pgd_none(*pgdp)) {
-			pudp = early_alloc_pgtable(PUD_TABLE_SIZE);
-			BUG_ON(pudp == NULL);
-			pgd_populate(&init_mm, pgdp, pudp);
-		}
-		pudp = pud_offset(pgdp, ea);
-		if (map_page_size == PUD_SIZE) {
-			ptep = (pte_t *)pudp;
-			goto set_the_pte;
-		}
-		if (pud_none(*pudp)) {
-			pmdp = early_alloc_pgtable(PMD_TABLE_SIZE);
-			BUG_ON(pmdp == NULL);
-			pud_populate(&init_mm, pudp, pmdp);
-		}
-		pmdp = pmd_offset(pudp, ea);
-		if (map_page_size == PMD_SIZE) {
-			ptep = pmdp_ptep(pmdp);
-			goto set_the_pte;
-		}
-		if (!pmd_present(*pmdp)) {
-			ptep = early_alloc_pgtable(PAGE_SIZE);
-			BUG_ON(ptep == NULL);
-			pmd_populate_kernel(&init_mm, pmdp, ptep);
-		}
-		ptep = pte_offset_kernel(pmdp, ea);
+
+	if (!slab_is_available())
+		return early_map_kernel_page(ea, pa, flags, map_page_size);
+
+	pgdp = pgd_offset_k(ea);
+	pudp = pud_alloc(&init_mm, pgdp, ea);
+	if (!pudp)
+		return -ENOMEM;
+	if (map_page_size == PUD_SIZE) {
+		ptep = (pte_t *)pudp;
+		goto set_the_pte;
+	}
+	pmdp = pmd_alloc(&init_mm, pudp, ea);
+	if (!pmdp)
+		return -ENOMEM;
+	if (map_page_size == PMD_SIZE) {
+		ptep = pmdp_ptep(pmdp);
+		goto set_the_pte;
 	}
+	ptep = pte_alloc_kernel(pmdp, ea);
+	if (!ptep)
+		return -ENOMEM;
 
 set_the_pte:
 	set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, flags));
@@ -864,7 +882,7 @@ static void remove_pagetable(unsigned long start, unsigned long end)
 
 int __ref radix__create_section_mapping(unsigned long start, unsigned long end, int nid)
 {
-	return create_physical_mapping(start, end, nid);
+	return create_physical_mapping(start, end);
 }
 
 int radix__remove_section_mapping(unsigned long start, unsigned long end)
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/14] powerpc/64s/radix: allocate kernel page tables node-local if possible
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (12 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 13/14] powerpc/64s/radix: split early page table mapping to its own function Nicholas Piggin
@ 2018-02-13 15:08 ` Nicholas Piggin
  2018-03-07 10:50 ` [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Michael Ellerman
  14 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-02-13 15:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Try to allocate kernel page tables for direct mapping and vmemmap
according to the node of the memory they will map. The node is not
available for the linear map in early boot, so use range allocation
to allocate the page tables from the region they map, which is
effectively node-local.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/pgtable-radix.c | 111 ++++++++++++++++++++++++++++++----------
 1 file changed, 85 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 4c5cc69c92c2..66b07718875a 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -48,11 +48,26 @@ static int native_register_process_table(unsigned long base, unsigned long pg_sz
 	return 0;
 }
 
-static __ref void *early_alloc_pgtable(unsigned long size)
+static __ref void *early_alloc_pgtable(unsigned long size, int nid,
+			unsigned long region_start, unsigned long region_end)
 {
+	unsigned long pa = 0;
 	void *pt;
 
-	pt = __va(memblock_alloc_base(size, size, MEMBLOCK_ALLOC_ANYWHERE));
+	if (region_start || region_end) /* has region hint */
+		pa = memblock_alloc_range(size, size, region_start, region_end,
+						MEMBLOCK_NONE);
+	else if (nid != -1) /* has node hint */
+		pa = memblock_alloc_base_nid(size, size,
+						MEMBLOCK_ALLOC_ANYWHERE,
+						nid, MEMBLOCK_NONE);
+
+	if (!pa)
+		pa = memblock_alloc_base(size, size, MEMBLOCK_ALLOC_ANYWHERE);
+
+	BUG_ON(!pa);
+
+	pt = __va(pa);
 	memset(pt, 0, size);
 
 	return pt;
@@ -60,8 +75,11 @@ static __ref void *early_alloc_pgtable(unsigned long size)
 
 static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 			  pgprot_t flags,
-			  unsigned int map_page_size)
+			  unsigned int map_page_size,
+			  int nid,
+			  unsigned long region_start, unsigned long region_end)
 {
+	unsigned long pfn = pa >> PAGE_SHIFT;
 	pgd_t *pgdp;
 	pud_t *pudp;
 	pmd_t *pmdp;
@@ -69,8 +87,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 
 	pgdp = pgd_offset_k(ea);
 	if (pgd_none(*pgdp)) {
-		pudp = early_alloc_pgtable(PUD_TABLE_SIZE);
-		BUG_ON(pudp == NULL);
+		pudp = early_alloc_pgtable(PUD_TABLE_SIZE, nid,
+						region_start, region_end);
 		pgd_populate(&init_mm, pgdp, pudp);
 	}
 	pudp = pud_offset(pgdp, ea);
@@ -79,8 +97,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 		goto set_the_pte;
 	}
 	if (pud_none(*pudp)) {
-		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE);
-		BUG_ON(pmdp == NULL);
+		pmdp = early_alloc_pgtable(PMD_TABLE_SIZE, nid,
+						region_start, region_end);
 		pud_populate(&init_mm, pudp, pmdp);
 	}
 	pmdp = pmd_offset(pudp, ea);
@@ -89,23 +107,29 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 		goto set_the_pte;
 	}
 	if (!pmd_present(*pmdp)) {
-		ptep = early_alloc_pgtable(PAGE_SIZE);
-		BUG_ON(ptep == NULL);
+		ptep = early_alloc_pgtable(PAGE_SIZE, nid,
+						region_start, region_end);
 		pmd_populate_kernel(&init_mm, pmdp, ptep);
 	}
 	ptep = pte_offset_kernel(pmdp, ea);
 
 set_the_pte:
-	set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, flags));
+	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
 	smp_wmb();
 	return 0;
 }
 
-
-int radix__map_kernel_page(unsigned long ea, unsigned long pa,
+/*
+ * nid, region_start, and region_end are hints to try to place the page
+ * table memory in the same node or region.
+ */
+static int __map_kernel_page(unsigned long ea, unsigned long pa,
 			  pgprot_t flags,
-			  unsigned int map_page_size)
+			  unsigned int map_page_size,
+			  int nid,
+			  unsigned long region_start, unsigned long region_end)
 {
+	unsigned long pfn = pa >> PAGE_SHIFT;
 	pgd_t *pgdp;
 	pud_t *pudp;
 	pmd_t *pmdp;
@@ -115,9 +139,15 @@ int radix__map_kernel_page(unsigned long ea, unsigned long pa,
 	 */
 	BUILD_BUG_ON(TASK_SIZE_USER64 > RADIX_PGTABLE_RANGE);
 
-	if (!slab_is_available())
-		return early_map_kernel_page(ea, pa, flags, map_page_size);
+	if (unlikely(!slab_is_available()))
+		return early_map_kernel_page(ea, pa, flags, map_page_size,
+						nid, region_start, region_end);
 
+	/*
+	 * Should make page table allocation functions be able to take a
+	 * node, so we can place kernel page tables on the right nodes after
+	 * boot.
+	 */
 	pgdp = pgd_offset_k(ea);
 	pudp = pud_alloc(&init_mm, pgdp, ea);
 	if (!pudp)
@@ -138,11 +168,25 @@ int radix__map_kernel_page(unsigned long ea, unsigned long pa,
 		return -ENOMEM;
 
 set_the_pte:
-	set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, flags));
+	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
 	smp_wmb();
 	return 0;
 }
 
+static int __map_kernel_page_nid(unsigned long ea, unsigned long pa,
+			  pgprot_t flags,
+			  unsigned int map_page_size, int nid)
+{
+	return __map_kernel_page(ea, pa, flags, map_page_size, nid, 0, 0);
+}
+
+int radix__map_kernel_page(unsigned long ea, unsigned long pa,
+			  pgprot_t flags,
+			  unsigned int map_page_size)
+{
+	return __map_kernel_page(ea, pa, flags, map_page_size, -1, 0, 0);
+}
+
 #ifdef CONFIG_STRICT_KERNEL_RWX
 void radix__change_memory_range(unsigned long start, unsigned long end,
 				unsigned long clear)
@@ -229,7 +273,8 @@ static inline void __meminit print_mapping(unsigned long start,
 }
 
 static int __meminit create_physical_mapping(unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     int nid)
 {
 	unsigned long vaddr, addr, mapping_size = 0;
 	pgprot_t prot;
@@ -285,7 +330,7 @@ static int __meminit create_physical_mapping(unsigned long start,
 		else
 			prot = PAGE_KERNEL;
 
-		rc = radix__map_kernel_page(vaddr, addr, prot, mapping_size);
+		rc = __map_kernel_page(vaddr, addr, prot, mapping_size, nid, start, end);
 		if (rc)
 			return rc;
 	}
@@ -294,7 +339,7 @@ static int __meminit create_physical_mapping(unsigned long start,
 	return 0;
 }
 
-static void __init radix_init_pgtable(void)
+void __init radix_init_pgtable(void)
 {
 	unsigned long rts_field;
 	struct memblock_region *reg;
@@ -304,9 +349,16 @@ static void __init radix_init_pgtable(void)
 	/*
 	 * Create the linear mapping, using standard page size for now
 	 */
-	for_each_memblock(memory, reg)
+	for_each_memblock(memory, reg) {
+		/*
+		 * The memblock allocator  is up at this point, so the
+		 * page tables will be allocated within the range. No
+		 * need or a node (which we don't have yet).
+		 */
 		WARN_ON(create_physical_mapping(reg->base,
-						reg->base + reg->size));
+						reg->base + reg->size,
+						-1));
+	}
 
 	/* Find out how many PID bits are supported */
 	if (cpu_has_feature(CPU_FTR_HVMODE)) {
@@ -335,7 +387,7 @@ static void __init radix_init_pgtable(void)
 	 * host.
 	 */
 	BUG_ON(PRTB_SIZE_SHIFT > 36);
-	process_tb = early_alloc_pgtable(1UL << PRTB_SIZE_SHIFT);
+	process_tb = early_alloc_pgtable(1UL << PRTB_SIZE_SHIFT, -1, 0, 0);
 	/*
 	 * Fill in the process table.
 	 */
@@ -716,14 +768,17 @@ static int stop_machine_change_mapping(void *data)
 {
 	struct change_mapping_params *params =
 			(struct change_mapping_params *)data;
+	int nid;
 
 	if (!data)
 		return -1;
 
+	nid = pfn_to_nid(params->start >> PAGE_SHIFT);
+
 	spin_unlock(&init_mm.page_table_lock);
 	pte_clear(&init_mm, params->aligned_start, params->pte);
-	create_physical_mapping(params->aligned_start, params->start);
-	create_physical_mapping(params->end, params->aligned_end);
+	create_physical_mapping(params->aligned_start, params->start, nid);
+	create_physical_mapping(params->end, params->aligned_end, nid);
 	spin_lock(&init_mm.page_table_lock);
 	return 0;
 }
@@ -882,7 +937,7 @@ static void remove_pagetable(unsigned long start, unsigned long end)
 
 int __ref radix__create_section_mapping(unsigned long start, unsigned long end, int nid)
 {
-	return create_physical_mapping(start, end);
+	return create_physical_mapping(start, end, nid);
 }
 
 int radix__remove_section_mapping(unsigned long start, unsigned long end)
@@ -899,8 +954,12 @@ int __meminit radix__vmemmap_create_mapping(unsigned long start,
 {
 	/* Create a PTE encoding */
 	unsigned long flags = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_KERNEL_RW;
+	int nid = early_pfn_to_nid(phys >> PAGE_SHIFT);
+	int ret;
+
+	ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, nid);
+	BUG_ON(ret);
 
-	BUG_ON(radix__map_kernel_page(start, phys, __pgprot(flags), page_size));
 	return 0;
 }
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables
  2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
                   ` (13 preceding siblings ...)
  2018-02-13 15:08 ` [PATCH 14/14] powerpc/64s/radix: allocate kernel page tables node-local if possible Nicholas Piggin
@ 2018-03-07 10:50 ` Michael Ellerman
  2018-03-07 11:23   ` Nicholas Piggin
  2018-03-08  2:04   ` Nicholas Piggin
  14 siblings, 2 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-07 10:50 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> This series allows numa aware allocations for various early data
> structures for radix. Hash still has a bolted SLB limitation that
> prevents at least pacas and stacks from node-affine allocations.
>
> Fixed up a number of bugs, got pSeries working, added a couple more
> cases where page tables can be allocated node-local.

Few problems in here:

FAILURE kernel-build-linux =C2=BB powerpc,gcc_ubuntu_be,pmac32
  arch/powerpc/kernel/prom.c:748:2: error: implicit declaration of function=
 'allocate_paca_ptrs' [-Werror=3Dimplicit-function-declaration]

FAILURE kernel-build-linux =C2=BB powerpc,gcc_ubuntu_le,powernv
  arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no=
 member named 'lppaca_ptr'
  arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no=
 member named 'lppaca_ptr'

Did I miss a follow-up or something?

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables
  2018-03-07 10:50 ` [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Michael Ellerman
@ 2018-03-07 11:23   ` Nicholas Piggin
  2018-03-08  2:04   ` Nicholas Piggin
  1 sibling, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-07 11:23 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Wed, 07 Mar 2018 21:50:04 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > This series allows numa aware allocations for various early data
> > structures for radix. Hash still has a bolted SLB limitation that
> > prevents at least pacas and stacks from node-affine allocations.
> >
> > Fixed up a number of bugs, got pSeries working, added a couple more
> > cases where page tables can be allocated node-local.  
> 
> Few problems in here:
> 
> FAILURE kernel-build-linux » powerpc,gcc_ubuntu_be,pmac32
>   arch/powerpc/kernel/prom.c:748:2: error: implicit declaration of function 'allocate_paca_ptrs' [-Werror=implicit-function-declaration]
> 
> FAILURE kernel-build-linux » powerpc,gcc_ubuntu_le,powernv
>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no member named 'lppaca_ptr'
>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no member named 'lppaca_ptr'
> 
> Did I miss a follow-up or something?

No, I probably just don't do enough compile testing on ppc32. Not
sure about the powernv error, probably just missed testing a config.
Do you have more logs?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables
  2018-03-07 10:50 ` [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Michael Ellerman
  2018-03-07 11:23   ` Nicholas Piggin
@ 2018-03-08  2:04   ` Nicholas Piggin
  2018-03-29  6:18     ` Michael Ellerman
  1 sibling, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-08  2:04 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Wed, 07 Mar 2018 21:50:04 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > This series allows numa aware allocations for various early data
> > structures for radix. Hash still has a bolted SLB limitation that
> > prevents at least pacas and stacks from node-affine allocations.
> >
> > Fixed up a number of bugs, got pSeries working, added a couple more
> > cases where page tables can be allocated node-local.  
> 
> Few problems in here:
> 
> FAILURE kernel-build-linux » powerpc,gcc_ubuntu_be,pmac32
>   arch/powerpc/kernel/prom.c:748:2: error: implicit declaration of function 'allocate_paca_ptrs' [-Werror=implicit-function-declaration]
> 
> FAILURE kernel-build-linux » powerpc,gcc_ubuntu_le,powernv
>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no member named 'lppaca_ptr'
>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no member named 'lppaca_ptr'
> 
> Did I miss a follow-up or something?

Here's a patch that applies to "powerpc/64: defer paca allocation
until memory topology is discovered". The first hunk fixes the ppc32
issue, and the second hunk avoids freeing the cpu_to_phys_id array
if the platform didn't allocate it. But I've just realized that
should go into the previous patch (which is missing the
memblock_free).
--

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 8aaa697701f1..fff29b8057d9 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -258,7 +258,8 @@ extern void free_unused_pacas(void);
 
 #else /* CONFIG_PPC64 */
 
-static inline void allocate_pacas(void) { };
+static inline void allocate_paca_ptrs(void) { };
+static inline allocate_paca(int cpu) { } ;
 static inline void free_unused_pacas(void) { };
 
 #endif /* CONFIG_PPC64 */
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 56f7a2b793e0..2ba05acc2973 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -854,8 +854,10 @@ static void smp_setup_pacas(void)
 		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
 	}
 
-	memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
-	cpu_to_phys_id = NULL;
+	if (cpu_to_phys_id) {
+		memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
+		cpu_to_phys_id = NULL;
+	}
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)
  2018-02-13 15:08 ` [PATCH 05/14] mm: make memblock_alloc_base_nid non-static Nicholas Piggin
@ 2018-03-13 12:06     ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-13 12:06 UTC (permalink / raw)
  To: akpm, mhocko, catalin.marinas, pasha.tatashin, takahiro.akashi,
	gi-oh.kim, npiggin, baiyaowei, bob.picco, ard.biesheuvel,
	linux-mm, linux-kernel
  Cc: Nicholas Piggin, linuxppc-dev

Anyone object to us merging the following patch via the powerpc tree?

Full series is here if anyone's interested:
  http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*

cheers

Nicholas Piggin <npiggin@gmail.com> writes:
> This will be used by powerpc to allocate per-cpu stacks and other
> data structures node-local where possible.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  include/linux/memblock.h | 5 ++++-
>  mm/memblock.c            | 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 8be5077efb5f..8cab51398705 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -316,9 +316,12 @@ static inline bool memblock_bottom_up(void)
>  #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
>  #define MEMBLOCK_ALLOC_ACCESSIBLE	0
>  
> -phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
> +phys_addr_t memblock_alloc_range(phys_addr_t size, phys_addr_t align,
>  					phys_addr_t start, phys_addr_t end,
>  					ulong flags);
> +phys_addr_t memblock_alloc_base_nid(phys_addr_t size,
> +					phys_addr_t align, phys_addr_t max_addr,
> +					int nid, ulong flags);
>  phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align,
>  				phys_addr_t max_addr);
>  phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align,
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5a9ca2a1751b..cea2af494da0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1190,7 +1190,7 @@ phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
>  					flags);
>  }
>  
> -static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
> +phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
>  					phys_addr_t align, phys_addr_t max_addr,
>  					int nid, ulong flags)
>  {
> -- 
> 2.16.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)
@ 2018-03-13 12:06     ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-13 12:06 UTC (permalink / raw)
  To: akpm, mhocko, catalin.marinas, pasha.tatashin, takahiro.akashi,
	gi-oh.kim, npiggin, baiyaowei, bob.picco, ard.biesheuvel,
	linux-mm, linux-kernel
  Cc: linuxppc-dev

Anyone object to us merging the following patch via the powerpc tree?

Full series is here if anyone's interested:
  http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*

cheers

Nicholas Piggin <npiggin@gmail.com> writes:
> This will be used by powerpc to allocate per-cpu stacks and other
> data structures node-local where possible.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  include/linux/memblock.h | 5 ++++-
>  mm/memblock.c            | 2 +-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 8be5077efb5f..8cab51398705 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -316,9 +316,12 @@ static inline bool memblock_bottom_up(void)
>  #define MEMBLOCK_ALLOC_ANYWHERE	(~(phys_addr_t)0)
>  #define MEMBLOCK_ALLOC_ACCESSIBLE	0
>  
> -phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
> +phys_addr_t memblock_alloc_range(phys_addr_t size, phys_addr_t align,
>  					phys_addr_t start, phys_addr_t end,
>  					ulong flags);
> +phys_addr_t memblock_alloc_base_nid(phys_addr_t size,
> +					phys_addr_t align, phys_addr_t max_addr,
> +					int nid, ulong flags);
>  phys_addr_t memblock_alloc_base(phys_addr_t size, phys_addr_t align,
>  				phys_addr_t max_addr);
>  phys_addr_t __memblock_alloc_base(phys_addr_t size, phys_addr_t align,
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5a9ca2a1751b..cea2af494da0 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1190,7 +1190,7 @@ phys_addr_t __init memblock_alloc_range(phys_addr_t size, phys_addr_t align,
>  					flags);
>  }
>  
> -static phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
> +phys_addr_t __init memblock_alloc_base_nid(phys_addr_t size,
>  					phys_addr_t align, phys_addr_t max_addr,
>  					int nid, ulong flags)
>  {
> -- 
> 2.16.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/14] powerpc/64: move default SPR recording
  2018-02-13 15:08 ` [PATCH 07/14] powerpc/64: move default SPR recording Nicholas Piggin
@ 2018-03-13 12:25   ` Michael Ellerman
  2018-03-13 12:55     ` Nicholas Piggin
  2018-03-13 15:47     ` Nicholas Piggin
  0 siblings, 2 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-13 12:25 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> Move this into the early setup code, and don't iterate over CPU masks.
> We don't want to call into sysfs so early from setup, and a future patch
> won't initialize CPU masks by the time this is called.
> ---
>  arch/powerpc/kernel/paca.c     |  3 +++
>  arch/powerpc/kernel/setup.h    |  9 +++------
>  arch/powerpc/kernel/setup_64.c |  8 ++++++++
>  arch/powerpc/kernel/sysfs.c    | 18 +++++++-----------
>  4 files changed, 21 insertions(+), 17 deletions(-)

This patch, and 8, 9, 10, aren't signed-off by you.

I'll assume you just forgot and add it.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/14] powerpc/64s: allocate lppacas individually
  2018-02-13 15:08 ` [PATCH 03/14] powerpc/64s: allocate lppacas individually Nicholas Piggin
@ 2018-03-13 12:41   ` Michael Ellerman
  2018-03-13 12:54     ` Nicholas Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Michael Ellerman @ 2018-03-13 12:41 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> diff --git a/arch/powerpc/platforms/pseries/kexec.c b/arch/powerpc/platforms/pseries/kexec.c
> index eeb13429d685..3fe126796975 100644
> --- a/arch/powerpc/platforms/pseries/kexec.c
> +++ b/arch/powerpc/platforms/pseries/kexec.c
> @@ -23,7 +23,12 @@
>  
>  void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
>  {
> -	/* Don't risk a hypervisor call if we're crashing */
> +	/*
> +	 * Don't risk a hypervisor call if we're crashing
> +	 * XXX: Why? The hypervisor is not crashing. It might be better
> +	 * to at least attempt unregister to avoid the hypervisor stepping
> +	 * on our memory.
> +	 */

Because every extra line of code we run in the crashed kernel is another
opportunity to screw up and not make it into the kdump kernel.

For example the hcalls we do to unregister the VPA might trigger hcall
tracing which runs a bunch of code and might trip up on something. We
could modify those hcalls to not be traced, but then we can't trace them
in normal operation.

And the hypervisor might continue to write to the VPA, but that's OK
because it's the VPA of the crashing kernel, the kdump kernel runs in a
separate reserved memory region.

Possibly we could fix the hcall tracing issues etc, but this code has
not given us any problems for quite a while (~13 years) - ie. there
seems to be no issue with re-registering the VPAs etc. in the kdump
kernel.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/14] powerpc/64s: allocate lppacas individually
  2018-03-13 12:41   ` Michael Ellerman
@ 2018-03-13 12:54     ` Nicholas Piggin
  2018-03-16 14:16       ` Michael Ellerman
  0 siblings, 1 reply; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-13 12:54 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Tue, 13 Mar 2018 23:41:46 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > diff --git a/arch/powerpc/platforms/pseries/kexec.c b/arch/powerpc/platforms/pseries/kexec.c
> > index eeb13429d685..3fe126796975 100644
> > --- a/arch/powerpc/platforms/pseries/kexec.c
> > +++ b/arch/powerpc/platforms/pseries/kexec.c
> > @@ -23,7 +23,12 @@
> >  
> >  void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
> >  {
> > -	/* Don't risk a hypervisor call if we're crashing */
> > +	/*
> > +	 * Don't risk a hypervisor call if we're crashing
> > +	 * XXX: Why? The hypervisor is not crashing. It might be better
> > +	 * to at least attempt unregister to avoid the hypervisor stepping
> > +	 * on our memory.
> > +	 */  
> 
> Because every extra line of code we run in the crashed kernel is another
> opportunity to screw up and not make it into the kdump kernel.
> 
> For example the hcalls we do to unregister the VPA might trigger hcall
> tracing which runs a bunch of code and might trip up on something. We
> could modify those hcalls to not be traced, but then we can't trace them
> in normal operation.

We really make no other hcalls in a crash? I didn't think of that.

> 
> And the hypervisor might continue to write to the VPA, but that's OK
> because it's the VPA of the crashing kernel, the kdump kernel runs in a
> separate reserved memory region.

Well that takes care of that concern.

> Possibly we could fix the hcall tracing issues etc, but this code has
> not given us any problems for quite a while (~13 years) - ie. there
> seems to be no issue with re-registering the VPAs etc. in the kdump
> kernel.

No I think it's okay then, if you could drop that hunk...

Thanks,
Nick

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/14] powerpc/64: move default SPR recording
  2018-03-13 12:25   ` Michael Ellerman
@ 2018-03-13 12:55     ` Nicholas Piggin
  2018-03-13 15:47     ` Nicholas Piggin
  1 sibling, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-13 12:55 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Tue, 13 Mar 2018 23:25:05 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > Move this into the early setup code, and don't iterate over CPU masks.
> > We don't want to call into sysfs so early from setup, and a future patch
> > won't initialize CPU masks by the time this is called.
> > ---
> >  arch/powerpc/kernel/paca.c     |  3 +++
> >  arch/powerpc/kernel/setup.h    |  9 +++------
> >  arch/powerpc/kernel/setup_64.c |  8 ++++++++
> >  arch/powerpc/kernel/sysfs.c    | 18 +++++++-----------
> >  4 files changed, 21 insertions(+), 17 deletions(-)  
> 
> This patch, and 8, 9, 10, aren't signed-off by you.
> 
> I'll assume you just forgot and add it.

Yes I did.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/14] powerpc/64: move default SPR recording
  2018-03-13 12:25   ` Michael Ellerman
  2018-03-13 12:55     ` Nicholas Piggin
@ 2018-03-13 15:47     ` Nicholas Piggin
  1 sibling, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-13 15:47 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Tue, 13 Mar 2018 23:25:05 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > Move this into the early setup code, and don't iterate over CPU masks.
> > We don't want to call into sysfs so early from setup, and a future patch
> > won't initialize CPU masks by the time this is called.
> > ---
> >  arch/powerpc/kernel/paca.c     |  3 +++
> >  arch/powerpc/kernel/setup.h    |  9 +++------
> >  arch/powerpc/kernel/setup_64.c |  8 ++++++++
> >  arch/powerpc/kernel/sysfs.c    | 18 +++++++-----------
> >  4 files changed, 21 insertions(+), 17 deletions(-)  
> 
> This patch, and 8, 9, 10, aren't signed-off by you.
> 
> I'll assume you just forgot and add it.

Can I give you an incremental fix for this patch? dscr_default
is zero at this point, so set it from spr_default_dscr before
setting pacas.

Remove the assignment from initialise_paca -- this happens too
early and gets overwritten anyway.

---
 arch/powerpc/kernel/paca.c  | 3 ---
 arch/powerpc/kernel/sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 2f3187501a36..7736188c764f 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -133,9 +133,6 @@ void __init initialise_paca(struct paca_struct *new_paca, int cpu)
 	new_paca->kexec_state = KEXEC_STATE_NONE;
 	new_paca->__current = &init_task;
 	new_paca->data_offset = 0xfeeeeeeeeeeeeeeeULL;
-#ifdef CONFIG_PPC64
-	new_paca->dscr_default = spr_default_dscr;
-#endif
 #ifdef CONFIG_PPC_BOOK3S_64
 	new_paca->slb_shadow_ptr = NULL;
 #endif
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index aaab582a640c..755dc98a57ae 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -20,6 +20,7 @@
 #include <asm/firmware.h>
 
 #include "cacheinfo.h"
+#include "setup.h"
 
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
@@ -592,6 +593,7 @@ static void sysfs_create_dscr_default(void)
 		int err = 0;
 		int cpu;
 
+		dscr_default = spr_default_dscr;
 		for_each_possible_cpu(cpu)
 			paca_ptrs[cpu]->dscr_default = dscr_default;
 
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)
  2018-03-13 12:06     ` Michael Ellerman
  (?)
@ 2018-03-13 19:41     ` Andrew Morton
  2018-03-14  0:56       ` Nicholas Piggin
  -1 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2018-03-13 19:41 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: mhocko, catalin.marinas, pasha.tatashin, takahiro.akashi,
	gi-oh.kim, npiggin, baiyaowei, bob.picco, ard.biesheuvel,
	linux-mm, linux-kernel, linuxppc-dev

On Tue, 13 Mar 2018 23:06:35 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:

> Anyone object to us merging the following patch via the powerpc tree?
> 
> Full series is here if anyone's interested:
>   http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*
> 

Yup, please go ahead.

I assume the change to the memblock_alloc_range() declaration was an
unrelated, unchangelogged cleanup.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static)
  2018-03-13 19:41     ` Andrew Morton
@ 2018-03-14  0:56       ` Nicholas Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-14  0:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michael Ellerman, mhocko, catalin.marinas, pasha.tatashin,
	takahiro.akashi, gi-oh.kim, baiyaowei, bob.picco, ard.biesheuvel,
	linux-mm, linux-kernel, linuxppc-dev

On Tue, 13 Mar 2018 12:41:28 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 13 Mar 2018 23:06:35 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
> 
> > Anyone object to us merging the following patch via the powerpc tree?
> > 
> > Full series is here if anyone's interested:
> >   http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=28377&state=*
> >   
> 
> Yup, please go ahead.
> 
> I assume the change to the memblock_alloc_range() declaration was an
> unrelated, unchangelogged cleanup.
> 

It is. I'm trying to get better at that. Michael might drop that bit if
he's not already sick of fixing up my patches...

Thanks,
Nick

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 03/14] powerpc/64s: allocate lppacas individually
  2018-03-13 12:54     ` Nicholas Piggin
@ 2018-03-16 14:16       ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-16 14:16 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev

Nicholas Piggin <npiggin@gmail.com> writes:
> On Tue, 13 Mar 2018 23:41:46 +1100
> Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>> > diff --git a/arch/powerpc/platforms/pseries/kexec.c b/arch/powerpc/platforms/pseries/kexec.c
>> > index eeb13429d685..3fe126796975 100644
>> > --- a/arch/powerpc/platforms/pseries/kexec.c
>> > +++ b/arch/powerpc/platforms/pseries/kexec.c
>> > @@ -23,7 +23,12 @@
>> >  
>> >  void pseries_kexec_cpu_down(int crash_shutdown, int secondary)
>> >  {
>> > -	/* Don't risk a hypervisor call if we're crashing */
>> > +	/*
>> > +	 * Don't risk a hypervisor call if we're crashing
>> > +	 * XXX: Why? The hypervisor is not crashing. It might be better
>> > +	 * to at least attempt unregister to avoid the hypervisor stepping
>> > +	 * on our memory.
>> > +	 */  
>> 
>> Because every extra line of code we run in the crashed kernel is another
>> opportunity to screw up and not make it into the kdump kernel.
>> 
>> For example the hcalls we do to unregister the VPA might trigger hcall
>> tracing which runs a bunch of code and might trip up on something. We
>> could modify those hcalls to not be traced, but then we can't trace them
>> in normal operation.
>
> We really make no other hcalls in a crash? I didn't think of that.

We do, but they're explicitly written to use plpar_hcall_raw().

And TBH I haven't tested a kdump with hcall tracing enabled lately, so
for all I know it's broken, but that's the theory at least.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 10/14] powerpc/64: allocate pacas per node
  2018-02-13 15:08 ` [PATCH 10/14] powerpc/64: allocate pacas per node Nicholas Piggin
@ 2018-03-29  5:50   ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-29  5:50 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> Per-node allocations are possible on 64s with radix that does
> not have the bolted SLB limitation.
>
> Hash would be able to do the same if all CPUs had the bottom of
> their node-local memory bolted as well. This is left as an
> exercise for the reader.
> ---
>  arch/powerpc/kernel/paca.c     | 41 +++++++++++++++++++++++++++++++++++------
>  arch/powerpc/kernel/setup_64.c |  4 ++++
>  2 files changed, 39 insertions(+), 6 deletions(-)

Added SOB.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 08/14] powerpc/setup: cpu_to_phys_id array
  2018-02-13 15:08 ` [PATCH 08/14] powerpc/setup: cpu_to_phys_id array Nicholas Piggin
@ 2018-03-29  5:51   ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-29  5:51 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> Build an array that finds hardware CPU number from logical CPU
> number in firmware CPU discovery. Use that rather than setting
> paca of other CPUs directly, to begin with. Subsequent patch will
> not have pacas allocated at this point.
> ---
>  arch/powerpc/include/asm/smp.h     |  1 +
>  arch/powerpc/kernel/prom.c         |  7 +++++++
>  arch/powerpc/kernel/setup-common.c | 15 ++++++++++++++-
>  3 files changed, 22 insertions(+), 1 deletion(-)

Added SOB.

> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 4dffef947b8a..5979e34ba90e 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -874,5 +874,12 @@ EXPORT_SYMBOL(cpu_to_chip_id);
>  
>  bool arch_match_cpu_phys_id(int cpu, u64 phys_id)
>  {
> +	/*
> +	 * Early firmware scanning must use this rather than
> +	 * get_hard_smp_processor_id because we don't have pacas allocated
> +	 * until memory topology is discovered.
> +	 */
> +	if (cpu_to_phys_id != NULL)
> +		return (int)phys_id == cpu_to_phys_id[cpu];

This needed an #ifdef CONFIG_SMP.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered
  2018-02-13 15:08 ` [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered Nicholas Piggin
@ 2018-03-29  5:51   ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-29  5:51 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:
> ---
>  arch/powerpc/include/asm/paca.h    |  3 +-
>  arch/powerpc/kernel/paca.c         | 90 ++++++++++++--------------------------
>  arch/powerpc/kernel/prom.c         |  5 ++-
>  arch/powerpc/kernel/setup-common.c | 24 +++++++---
>  4 files changed, 51 insertions(+), 71 deletions(-)

Added SOB.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/14] powerpc: pass node id into create_section_mapping
  2018-02-13 15:08 ` [PATCH 12/14] powerpc: pass node id into create_section_mapping Nicholas Piggin
@ 2018-03-29  5:51   ` Michael Ellerman
  2018-03-29 15:15     ` Nicholas Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Michael Ellerman @ 2018-03-29  5:51 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:

> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 328ff9abc333..435b19e74508 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -862,9 +862,9 @@ static void remove_pagetable(unsigned long start, unsigned long end)
>  	radix__flush_tlb_kernel_range(start, end);
>  }
>  
> -int __ref radix__create_section_mapping(unsigned long start, unsigned long end)
> +int __ref radix__create_section_mapping(unsigned long start, unsigned long end, int nid)
>  {
> -	return create_physical_mapping(start, end);
> +	return create_physical_mapping(start, end, nid);
>  }

This got a little muddled. We add the nid argument here, but
create_physical_mapping() doesn't take it until patch 14.

I managed to fix it by rearranging the last three patches and fiddling
things a bit. If you can check the result once I push that would be good.

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables
  2018-03-08  2:04   ` Nicholas Piggin
@ 2018-03-29  6:18     ` Michael Ellerman
  2018-03-29 12:04       ` Nicholas Piggin
  0 siblings, 1 reply; 36+ messages in thread
From: Michael Ellerman @ 2018-03-29  6:18 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev

Nicholas Piggin <npiggin@gmail.com> writes:
> On Wed, 07 Mar 2018 21:50:04 +1100
> Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>> > This series allows numa aware allocations for various early data
>> > structures for radix. Hash still has a bolted SLB limitation that
>> > prevents at least pacas and stacks from node-affine allocations.
>> >
>> > Fixed up a number of bugs, got pSeries working, added a couple more
>> > cases where page tables can be allocated node-local.=20=20
>>=20
>> Few problems in here:
>>=20
>> FAILURE kernel-build-linux =C2=BB powerpc,gcc_ubuntu_be,pmac32
>>   arch/powerpc/kernel/prom.c:748:2: error: implicit declaration of funct=
ion 'allocate_paca_ptrs' [-Werror=3Dimplicit-function-declaration]
>>=20
>> FAILURE kernel-build-linux =C2=BB powerpc,gcc_ubuntu_le,powernv
>>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has=
 no member named 'lppaca_ptr'
>>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has=
 no member named 'lppaca_ptr'
>>=20
>> Did I miss a follow-up or something?
>
> Here's a patch that applies to "powerpc/64: defer paca allocation
> until memory topology is discovered". The first hunk fixes the ppc32
> issue, and the second hunk avoids freeing the cpu_to_phys_id array
> if the platform didn't allocate it. But I've just realized that
> should go into the previous patch (which is missing the
> memblock_free).
> --
...
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/set=
up-common.c
> index 56f7a2b793e0..2ba05acc2973 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -854,8 +854,10 @@ static void smp_setup_pacas(void)
>  		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
>  	}
>=20=20
> -	memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
> -	cpu_to_phys_id =3D NULL;
> +	if (cpu_to_phys_id) {
> +		memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
> +		cpu_to_phys_id =3D NULL;
> +	}
>  }
>  #endif
=20
Where did you want that?

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables
  2018-03-29  6:18     ` Michael Ellerman
@ 2018-03-29 12:04       ` Nicholas Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-29 12:04 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Thu, 29 Mar 2018 17:18:12 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> > On Wed, 07 Mar 2018 21:50:04 +1100
> > Michael Ellerman <mpe@ellerman.id.au> wrote:  
> >> Nicholas Piggin <npiggin@gmail.com> writes:  
> >> > This series allows numa aware allocations for various early data
> >> > structures for radix. Hash still has a bolted SLB limitation that
> >> > prevents at least pacas and stacks from node-affine allocations.
> >> >
> >> > Fixed up a number of bugs, got pSeries working, added a couple more
> >> > cases where page tables can be allocated node-local.    
> >> 
> >> Few problems in here:
> >> 
> >> FAILURE kernel-build-linux » powerpc,gcc_ubuntu_be,pmac32
> >>   arch/powerpc/kernel/prom.c:748:2: error: implicit declaration of function 'allocate_paca_ptrs' [-Werror=implicit-function-declaration]
> >> 
> >> FAILURE kernel-build-linux » powerpc,gcc_ubuntu_le,powernv
> >>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no member named 'lppaca_ptr'
> >>   arch/powerpc/include/asm/paca.h:49:33: error: 'struct paca_struct' has no member named 'lppaca_ptr'
> >> 
> >> Did I miss a follow-up or something?  
> >
> > Here's a patch that applies to "powerpc/64: defer paca allocation
> > until memory topology is discovered". The first hunk fixes the ppc32
> > issue, and the second hunk avoids freeing the cpu_to_phys_id array
> > if the platform didn't allocate it. But I've just realized that
> > should go into the previous patch (which is missing the
> > memblock_free).
> > --  
> ...
> > diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> > index 56f7a2b793e0..2ba05acc2973 100644
> > --- a/arch/powerpc/kernel/setup-common.c
> > +++ b/arch/powerpc/kernel/setup-common.c
> > @@ -854,8 +854,10 @@ static void smp_setup_pacas(void)
> >  		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
> >  	}
> >  
> > -	memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
> > -	cpu_to_phys_id = NULL;
> > +	if (cpu_to_phys_id) {
> > +		memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
> > +		cpu_to_phys_id = NULL;
> > +	}
> >  }
> >  #endif  
>  
> Where did you want that?

Patch 8 should have

	if (cpu_to_phys_id) {
		memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
		cpu_to_phys_id = NULL;
	}

Right after its ight after the set_hard_smp_processor_id() loop. Patch 9
moves it all to smp_setup_pacas plus the allocate_paca() call in the loop.

I think that makes sense.

Thanks for fixing it all up, had a few rough edges.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/14] powerpc: pass node id into create_section_mapping
  2018-03-29  5:51   ` Michael Ellerman
@ 2018-03-29 15:15     ` Nicholas Piggin
  0 siblings, 0 replies; 36+ messages in thread
From: Nicholas Piggin @ 2018-03-29 15:15 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

On Thu, 29 Mar 2018 16:51:16 +1100
Michael Ellerman <mpe@ellerman.id.au> wrote:

> Nicholas Piggin <npiggin@gmail.com> writes:
> 
> > diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> > index 328ff9abc333..435b19e74508 100644
> > --- a/arch/powerpc/mm/pgtable-radix.c
> > +++ b/arch/powerpc/mm/pgtable-radix.c
> > @@ -862,9 +862,9 @@ static void remove_pagetable(unsigned long start, unsigned long end)
> >  	radix__flush_tlb_kernel_range(start, end);
> >  }
> >  
> > -int __ref radix__create_section_mapping(unsigned long start, unsigned long end)
> > +int __ref radix__create_section_mapping(unsigned long start, unsigned long end, int nid)
> >  {
> > -	return create_physical_mapping(start, end);
> > +	return create_physical_mapping(start, end, nid);
> >  }  
> 
> This got a little muddled. We add the nid argument here, but
> create_physical_mapping() doesn't take it until patch 14.
> 
> I managed to fix it by rearranging the last three patches and fiddling
> things a bit. If you can check the result once I push that would be good.

I think it looks okay how you've got it.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [01/14] powerpc/64s: do not allocate lppaca if we are not virtualized
  2018-02-13 15:08 ` [PATCH 01/14] powerpc/64s: do not allocate lppaca if we are not virtualized Nicholas Piggin
@ 2018-03-31 14:03   ` Michael Ellerman
  0 siblings, 0 replies; 36+ messages in thread
From: Michael Ellerman @ 2018-03-31 14:03 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

On Tue, 2018-02-13 at 15:08:11 UTC, Nicholas Piggin wrote:
> The "lppaca" is a structure registered with the hypervisor. This
> is unnecessary when running on non-virtualised platforms. One field
> from the lppaca (pmcregs_in_use) is also used by the host, so move
> the host part out into the paca (lppaca field is still updated in
> guest mode).
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8e0b634b132752ec3eba50afb95250

cheers

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-03-31 14:03 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 15:08 [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Nicholas Piggin
2018-02-13 15:08 ` [PATCH 01/14] powerpc/64s: do not allocate lppaca if we are not virtualized Nicholas Piggin
2018-03-31 14:03   ` [01/14] " Michael Ellerman
2018-02-13 15:08 ` [PATCH 02/14] powerpc/64: Use array of paca pointers and allocate pacas individually Nicholas Piggin
2018-02-13 15:08 ` [PATCH 03/14] powerpc/64s: allocate lppacas individually Nicholas Piggin
2018-03-13 12:41   ` Michael Ellerman
2018-03-13 12:54     ` Nicholas Piggin
2018-03-16 14:16       ` Michael Ellerman
2018-02-13 15:08 ` [PATCH 04/14] powerpc/64s: allocate slb_shadow structures individually Nicholas Piggin
2018-02-13 15:08 ` [PATCH 05/14] mm: make memblock_alloc_base_nid non-static Nicholas Piggin
2018-03-13 12:06   ` OK to merge via powerpc? (was Re: [PATCH 05/14] mm: make memblock_alloc_base_nid non-static) Michael Ellerman
2018-03-13 12:06     ` Michael Ellerman
2018-03-13 19:41     ` Andrew Morton
2018-03-14  0:56       ` Nicholas Piggin
2018-02-13 15:08 ` [PATCH 06/14] powerpc/mm/numa: move numa topology discovery earlier Nicholas Piggin
2018-02-13 15:08 ` [PATCH 07/14] powerpc/64: move default SPR recording Nicholas Piggin
2018-03-13 12:25   ` Michael Ellerman
2018-03-13 12:55     ` Nicholas Piggin
2018-03-13 15:47     ` Nicholas Piggin
2018-02-13 15:08 ` [PATCH 08/14] powerpc/setup: cpu_to_phys_id array Nicholas Piggin
2018-03-29  5:51   ` Michael Ellerman
2018-02-13 15:08 ` [PATCH 09/14] powerpc/64: defer paca allocation until memory topology is discovered Nicholas Piggin
2018-03-29  5:51   ` Michael Ellerman
2018-02-13 15:08 ` [PATCH 10/14] powerpc/64: allocate pacas per node Nicholas Piggin
2018-03-29  5:50   ` Michael Ellerman
2018-02-13 15:08 ` [PATCH 11/14] powerpc/64: allocate per-cpu stacks node-local if possible Nicholas Piggin
2018-02-13 15:08 ` [PATCH 12/14] powerpc: pass node id into create_section_mapping Nicholas Piggin
2018-03-29  5:51   ` Michael Ellerman
2018-03-29 15:15     ` Nicholas Piggin
2018-02-13 15:08 ` [PATCH 13/14] powerpc/64s/radix: split early page table mapping to its own function Nicholas Piggin
2018-02-13 15:08 ` [PATCH 14/14] powerpc/64s/radix: allocate kernel page tables node-local if possible Nicholas Piggin
2018-03-07 10:50 ` [PATCH 00/14] numa aware allocation for pacas, stacks, pagetables Michael Ellerman
2018-03-07 11:23   ` Nicholas Piggin
2018-03-08  2:04   ` Nicholas Piggin
2018-03-29  6:18     ` Michael Ellerman
2018-03-29 12:04       ` Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.