LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX
@ 2021-02-11 13:51 Michael Ellerman
  2021-02-11 13:51 ` [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp() Michael Ellerman
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-02-11 13:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar

In the past we had a fallback definition for _PAGE_KERNEL_ROX, but we
removed that in commit d82fd29c5a8c ("powerpc/mm: Distribute platform
specific PAGE and PMD flags and definitions") and added definitions
for each MMU family.

However we missed adding a definition for 64s, which was not really a
bug because it's currently not used.

But we'd like to use PAGE_KERNEL_ROX in a future patch so add a
definition now.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index a39886681629..1358fae01d04 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -116,6 +116,7 @@
  */
 #define _PAGE_KERNEL_RW		(_PAGE_PRIVILEGED | _PAGE_RW | _PAGE_DIRTY)
 #define _PAGE_KERNEL_RO		 (_PAGE_PRIVILEGED | _PAGE_READ)
+#define _PAGE_KERNEL_ROX	 (_PAGE_PRIVILEGED | _PAGE_READ | _PAGE_EXEC)
 #define _PAGE_KERNEL_RWX	(_PAGE_PRIVILEGED | _PAGE_DIRTY |	\
 				 _PAGE_RW | _PAGE_EXEC)
 /*
-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()
  2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
@ 2021-02-11 13:51 ` Michael Ellerman
  2021-02-16  5:39   ` Daniel Axtens
  2021-02-11 13:51 ` [PATCH 3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro() Michael Ellerman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-02-11 13:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar

The flags argument to plpar_pte_protect() (aka. H_PROTECT), includes
the key in bits 9-13, but currently we always set those bits to zero.

In the past that hasn't been a problem because we always used key 0
for the kernel, and updateboltedpp() is only used for kernel mappings.

However since commit d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3
for kernel mapping with hash translation") we are now inadvertently
changing the key (to zero) when we call plpar_pte_protect().

That hasn't broken anything because updateboltedpp() is only used for
STRICT_KERNEL_RWX, which is currently disabled on 64s due to other
bugs.

But we want to fix that, so first we need to pass the key correctly to
plpar_pte_protect(). In the `newpp` value the low 3 bits of the key
are already in the correct spot, but the high 2 bits of the key need
to be shifted down.

Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/platforms/pseries/lpar.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 764170fdb0f7..8bbbddff7226 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -976,11 +976,13 @@ static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp,
 	slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
 	BUG_ON(slot == -1);
 
-	flags = newpp & 7;
+	flags = newpp & (HPTE_R_PP | HPTE_R_N);
 	if (mmu_has_feature(MMU_FTR_KERNEL_RO))
 		/* Move pp0 into bit 8 (IBM 55) */
 		flags |= (newpp & HPTE_R_PP0) >> 55;
 
+	flags |= ((newpp & HPTE_R_KEY_HI) >> 48) | (newpp & HPTE_R_KEY_LO);
+
 	lpar_rc = plpar_pte_protect(flags, slot, 0);
 
 	BUG_ON(lpar_rc != H_SUCCESS);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro()
  2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
  2021-02-11 13:51 ` [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp() Michael Ellerman
@ 2021-02-11 13:51 ` Michael Ellerman
  2021-02-16  5:50   ` Daniel Axtens
  2021-02-11 13:51 ` [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range() Michael Ellerman
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-02-11 13:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar

In hash__mark_rodata_ro() we pass the raw PP_RXXX value to
hash__change_memory_range(). That has the effect of setting the key to
zero, because PP_RXXX contains no key value.

Fix it by using htab_convert_pte_flags(), which knows how to convert a
pgprot into a pp value, including the key.

Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/book3s64/hash_pgtable.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 567e0c6b3978..03819c259f0a 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -428,12 +428,14 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
 
 void hash__mark_rodata_ro(void)
 {
-	unsigned long start, end;
+	unsigned long start, end, pp;
 
 	start = (unsigned long)_stext;
 	end = (unsigned long)__init_begin;
 
-	WARN_ON(!hash__change_memory_range(start, end, PP_RXXX));
+	pp = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL_ROX), HPTE_USE_KERNEL_KEY);
+
+	WARN_ON(!hash__change_memory_range(start, end, pp));
 }
 
 void hash__mark_initmem_nx(void)
-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range()
  2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
  2021-02-11 13:51 ` [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp() Michael Ellerman
  2021-02-11 13:51 ` [PATCH 3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro() Michael Ellerman
@ 2021-02-11 13:51 ` Michael Ellerman
  2021-02-19  2:08   ` Daniel Axtens
  2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-02-11 13:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar

Pull the loop calling hpte_updateboltedpp() out of
hash__change_memory_range() into a helper function. We need it to be a
separate function for the next patch.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/book3s64/hash_pgtable.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 03819c259f0a..3663d3cdffac 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -400,10 +400,23 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+static void change_memory_range(unsigned long start, unsigned long end,
+				unsigned int step, unsigned long newpp)
+{
+	unsigned long idx;
+
+	pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 0x%x\n",
+		 start, end, newpp, step);
+
+	for (idx = start; idx < end; idx += step)
+		/* Not sure if we can do much with the return value */
+		mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
+							mmu_kernel_ssize);
+}
+
 static bool hash__change_memory_range(unsigned long start, unsigned long end,
 				      unsigned long newpp)
 {
-	unsigned long idx;
 	unsigned int step, shift;
 
 	shift = mmu_psize_defs[mmu_linear_psize].shift;
@@ -415,13 +428,7 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
 	if (start >= end)
 		return false;
 
-	pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 0x%x\n",
-		 start, end, newpp, step);
-
-	for (idx = start; idx < end; idx += step)
-		/* Not sure if we can do much with the return value */
-		mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
-							mmu_kernel_ssize);
+	change_memory_range(start, end, step, newpp);
 
 	return true;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
                   ` (2 preceding siblings ...)
  2021-02-11 13:51 ` [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range() Michael Ellerman
@ 2021-02-11 13:51 ` Michael Ellerman
  2021-02-11 23:16   ` Nicholas Piggin
                     ` (2 more replies)
  2021-02-11 13:51 ` [PATCH 6/6] powerpc/mm/64s: Allow STRICT_KERNEL_RWX again Michael Ellerman
  2021-04-10 14:28 ` [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
  5 siblings, 3 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-02-11 13:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar

When we enabled STRICT_KERNEL_RWX we received some reports of boot
failures when using the Hash MMU and running under phyp. The crashes
are intermittent, and often exhibit as a completely unresponsive
system, or possibly an oops.

One example, which was caught in xmon:

  [   14.068327][    T1] devtmpfs: mounted
  [   14.069302][    T1] Freeing unused kernel memory: 5568K
  [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
  [   14.142063][    T1] Run /sbin/init as init process
  [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
  cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
      pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
      lr: c0000000001862d4: update_rq_clock+0x44/0x110
      sp: c00000000c747880
     msr: 8000000040001031
    current = 0xc00000000c60d380
    paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
      pid   = 347, comm = kworker/2:1
  ...
  enter ? for help
  [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
  [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
  [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
  [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
  [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
  [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
  [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
  [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
  [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
  [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
  [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c

This shows that CPU 2, which was idle, woke up and then appears to
randomly take an instruction fault on a completely valid area of
kernel text.

The cause turns out to be the call to hash__mark_rodata_ro(), late in
boot. Due to the way we layout text and rodata, that function actually
changes the permissions for all of text and rodata to read-only plus
execute.

To do the permission change we use a hypervisor call, H_PROTECT. On
phyp that appears to be implemented by briefly removing the mapping of
the kernel text, before putting it back with the updated permissions.
If any other CPU is executing during that window, it will see spurious
faults on the kernel text and/or data, leading to crashes.

To fix it we use stop machine to collect all other CPUs, and then have
them drop into real mode (MMU off), while we change the mapping. That
way they are unaffected by the mapping temporarily disappearing.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 3663d3cdffac..01de985df2c4 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -8,6 +8,7 @@
 #include <linux/sched.h>
 #include <linux/mm_types.h>
 #include <linux/mm.h>
+#include <linux/stop_machine.h>
 
 #include <asm/sections.h>
 #include <asm/mmu.h>
@@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct change_memory_parms {
+	unsigned long start, end, newpp;
+	unsigned int step, nr_cpus, master_cpu;
+	atomic_t cpu_counter;
+};
+
+// We'd rather this was on the stack but it has to be in the RMO
+static struct change_memory_parms chmem_parms;
+
+// And therefore we need a lock to protect it from concurrent use
+static DEFINE_MUTEX(chmem_lock);
+
 static void change_memory_range(unsigned long start, unsigned long end,
 				unsigned int step, unsigned long newpp)
 {
@@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
 							mmu_kernel_ssize);
 }
 
+static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
+{
+	unsigned long msr, tmp, flags;
+	int *p;
+
+	p = &parms->cpu_counter.counter;
+
+	local_irq_save(flags);
+	__hard_EE_RI_disable();
+
+	asm volatile (
+	// Switch to real mode and leave interrupts off
+	"mfmsr	%[msr]			;"
+	"li	%[tmp], %[MSR_IR_DR]	;"
+	"andc	%[tmp], %[msr], %[tmp]	;"
+	"mtmsrd %[tmp]			;"
+
+	// Tell the master we are in real mode
+	"1:				"
+	"lwarx	%[tmp], 0, %[p]		;"
+	"addic	%[tmp], %[tmp], -1	;"
+	"stwcx.	%[tmp], 0, %[p]		;"
+	"bne-	1b			;"
+
+	// Spin until the counter goes to zero
+	"2:				;"
+	"lwz	%[tmp], 0(%[p])		;"
+	"cmpwi	%[tmp], 0		;"
+	"bne-	2b			;"
+
+	// Switch back to virtual mode
+	"mtmsrd %[msr]			;"
+
+	: // outputs
+	  [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
+	: // inputs
+	  [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
+	: // clobbers
+	  "cc", "xer"
+	);
+
+	local_irq_restore(flags);
+
+	return 0;
+}
+
+static int change_memory_range_fn(void *data)
+{
+	struct change_memory_parms *parms = data;
+
+	if (parms->master_cpu != smp_processor_id())
+		return chmem_secondary_loop(parms);
+
+	// Wait for all but one CPU (this one) to call-in
+	while (atomic_read(&parms->cpu_counter) > 1)
+		barrier();
+
+	change_memory_range(parms->start, parms->end, parms->step, parms->newpp);
+
+	mb();
+
+	// Signal the other CPUs that we're done
+	atomic_dec(&parms->cpu_counter);
+
+	return 0;
+}
+
 static bool hash__change_memory_range(unsigned long start, unsigned long end,
 				      unsigned long newpp)
 {
@@ -428,7 +509,29 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
 	if (start >= end)
 		return false;
 
-	change_memory_range(start, end, step, newpp);
+	if (firmware_has_feature(FW_FEATURE_LPAR)) {
+		mutex_lock(&chmem_lock);
+
+		chmem_parms.start = start;
+		chmem_parms.end = end;
+		chmem_parms.step = step;
+		chmem_parms.newpp = newpp;
+		chmem_parms.master_cpu = smp_processor_id();
+
+		cpus_read_lock();
+
+		atomic_set(&chmem_parms.cpu_counter, num_online_cpus());
+
+		// Ensure state is consistent before we call the other CPUs
+		mb();
+
+		stop_machine_cpuslocked(change_memory_range_fn, &chmem_parms,
+					cpu_online_mask);
+
+		cpus_read_unlock();
+		mutex_unlock(&chmem_lock);
+	} else
+		change_memory_range(start, end, step, newpp);
 
 	return true;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/6] powerpc/mm/64s: Allow STRICT_KERNEL_RWX again
  2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
                   ` (3 preceding siblings ...)
  2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
@ 2021-02-11 13:51 ` Michael Ellerman
  2021-04-10 14:28 ` [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
  5 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-02-11 13:51 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar

We have now fixed the known bugs in STRICT_KERNEL_RWX for Book3S
64-bit Hash and Radix MMUs, see preceding commits, so allow the
option to be selected again.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..7d6080afbad6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,7 @@ config PPC
 	select ARCH_HAS_MEMBARRIER_CALLBACKS
 	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_SCALED_CPUTIME		if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
-	select ARCH_HAS_STRICT_KERNEL_RWX	if (PPC32 && !HIBERNATION)
+	select ARCH_HAS_STRICT_KERNEL_RWX	if ((PPC_BOOK3S_64 || PPC32) && !HIBERNATION)
 	select ARCH_HAS_TICK_BROADCAST		if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAS_UACCESS_FLUSHCACHE
 	select ARCH_HAS_COPY_MC			if PPC64
-- 
2.25.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
@ 2021-02-11 23:16   ` Nicholas Piggin
  2021-03-20 13:04     ` Michael Ellerman
  2021-02-12  0:36   ` Nicholas Piggin
  2021-02-19  2:43   ` Daniel Axtens
  2 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2021-02-11 23:16 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar

Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
> 
> One example, which was caught in xmon:
> 
>   [   14.068327][    T1] devtmpfs: mounted
>   [   14.069302][    T1] Freeing unused kernel memory: 5568K
>   [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
>   [   14.142063][    T1] Run /sbin/init as init process
>   [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
>   cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>       pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>       lr: c0000000001862d4: update_rq_clock+0x44/0x110
>       sp: c00000000c747880
>      msr: 8000000040001031
>     current = 0xc00000000c60d380
>     paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
>       pid   = 347, comm = kworker/2:1
>   ...
>   enter ? for help
>   [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>   [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>   [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>   [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>   [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>   [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>   [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>   [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>   [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>   [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>   [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
> 
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
> 
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
> 
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.
> 
> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
>  1 file changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
>  #include <linux/mm.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/sections.h>
>  #include <asm/mmu.h>
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> +	unsigned long start, end, newpp;
> +	unsigned int step, nr_cpus, master_cpu;
> +	atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
>  static void change_memory_range(unsigned long start, unsigned long end,
>  				unsigned int step, unsigned long newpp)
>  {
> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>  							mmu_kernel_ssize);
>  }
>  
> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
> +{
> +	unsigned long msr, tmp, flags;
> +	int *p;
> +
> +	p = &parms->cpu_counter.counter;
> +
> +	local_irq_save(flags);
> +	__hard_EE_RI_disable();
> +
> +	asm volatile (
> +	// Switch to real mode and leave interrupts off
> +	"mfmsr	%[msr]			;"
> +	"li	%[tmp], %[MSR_IR_DR]	;"
> +	"andc	%[tmp], %[msr], %[tmp]	;"
> +	"mtmsrd %[tmp]			;"
> +
> +	// Tell the master we are in real mode
> +	"1:				"
> +	"lwarx	%[tmp], 0, %[p]		;"
> +	"addic	%[tmp], %[tmp], -1	;"
> +	"stwcx.	%[tmp], 0, %[p]		;"
> +	"bne-	1b			;"
> +
> +	// Spin until the counter goes to zero
> +	"2:				;"
> +	"lwz	%[tmp], 0(%[p])		;"
> +	"cmpwi	%[tmp], 0		;"
> +	"bne-	2b			;"
> +
> +	// Switch back to virtual mode
> +	"mtmsrd %[msr]			;"
> +
> +	: // outputs
> +	  [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
> +	: // inputs
> +	  [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
> +	: // clobbers
> +	  "cc", "xer"
> +	);
> +
> +	local_irq_restore(flags);

Hmm. __hard_EE_RI_disable won't get restored by this because it doesn't
set the HARD_DIS flag. Also we don't want RI disabled here because 
tracing will get called first (which might take SLB or HPTE fault).

But it's also slightly rude to ever enable EE under an irq soft mask,
because you don't know if it had been disabled by the masked interrupt 
handler. It's not strictly a problem AFAIK because the interrupt would
just get masked again, but if we try to maintain a good pattern would
be good. Hmm that means we should add a check for irqs soft masked in
__hard_irq_enable(), I'm not sure if all existing users would follow
this rule.

Might be better to call hard_irq_disable(); after the local_irq_save();
and then clear and reset RI inside that region (could just do it at the
same time as disabling MMU).

You could possibly pass old_msr and new_msr into asm directly and do
mfmsr() in C?

Clearing RI here unfortuantely I don't think will prevent interrupt
handlers (sreset or MCE) from trying to go into virtual mode if they
hit here. It only prevents them trying to return.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
  2021-02-11 23:16   ` Nicholas Piggin
@ 2021-02-12  0:36   ` Nicholas Piggin
  2021-03-16  6:40     ` Michael Ellerman
  2021-02-19  2:43   ` Daniel Axtens
  2 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2021-02-12  0:36 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar

Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
> 
> One example, which was caught in xmon:
> 
>   [   14.068327][    T1] devtmpfs: mounted
>   [   14.069302][    T1] Freeing unused kernel memory: 5568K
>   [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
>   [   14.142063][    T1] Run /sbin/init as init process
>   [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
>   cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>       pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>       lr: c0000000001862d4: update_rq_clock+0x44/0x110
>       sp: c00000000c747880
>      msr: 8000000040001031
>     current = 0xc00000000c60d380
>     paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
>       pid   = 347, comm = kworker/2:1
>   ...
>   enter ? for help
>   [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>   [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>   [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>   [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>   [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>   [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>   [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>   [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>   [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>   [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>   [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
> 
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
> 
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
> 
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.
> 
> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
>  1 file changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
>  #include <linux/mm.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/sections.h>
>  #include <asm/mmu.h>
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> +	unsigned long start, end, newpp;
> +	unsigned int step, nr_cpus, master_cpu;
> +	atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
>  static void change_memory_range(unsigned long start, unsigned long end,
>  				unsigned int step, unsigned long newpp)
>  {
> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>  							mmu_kernel_ssize);
>  }
>  
> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
> +{
> +	unsigned long msr, tmp, flags;
> +	int *p;
> +
> +	p = &parms->cpu_counter.counter;
> +
> +	local_irq_save(flags);
> +	__hard_EE_RI_disable();
> +
> +	asm volatile (
> +	// Switch to real mode and leave interrupts off
> +	"mfmsr	%[msr]			;"
> +	"li	%[tmp], %[MSR_IR_DR]	;"
> +	"andc	%[tmp], %[msr], %[tmp]	;"
> +	"mtmsrd %[tmp]			;"
> +
> +	// Tell the master we are in real mode
> +	"1:				"
> +	"lwarx	%[tmp], 0, %[p]		;"
> +	"addic	%[tmp], %[tmp], -1	;"
> +	"stwcx.	%[tmp], 0, %[p]		;"
> +	"bne-	1b			;"
> +
> +	// Spin until the counter goes to zero
> +	"2:				;"
> +	"lwz	%[tmp], 0(%[p])		;"
> +	"cmpwi	%[tmp], 0		;"
> +	"bne-	2b			;"
> +
> +	// Switch back to virtual mode
> +	"mtmsrd %[msr]			;"

Pity we don't have something that can switch to emergency stack and
so we can write this stuff in C.

How's something like this suit you?

---
 arch/powerpc/kernel/misc_64.S | 22 +++++++++++++++++++++
 arch/powerpc/kernel/process.c | 37 +++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index 070465825c21..5e911d0b0b16 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -27,6 +27,28 @@
 
 	.text
 
+#ifdef CONFIG_PPC_BOOK3S_64
+_GLOBAL(__call_realmode)
+	mflr	r0
+	std	r0,16(r1)
+	stdu	r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
+	mr	r1,r5
+	mtctr	r3
+	mr	r3,r4
+	mfmsr	r4
+	xori	r4,r4,(MSR_IR|MSR_DR)
+	mtmsrd	r4
+	bctrl
+	mfmsr	r4
+	xori	r4,r4,(MSR_IR|MSR_DR)
+	mtmsrd	r4
+	ld	r1,0(r1)
+	ld	r0,16(r1)
+	mtlr	r0
+	blr
+
+#endif
+
 _GLOBAL(call_do_softirq)
 	mflr	r0
 	std	r0,16(r1)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index a66f435dabbf..260d60f665a3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2197,6 +2197,43 @@ void show_stack(struct task_struct *tsk, unsigned long *stack,
 	put_task_stack(tsk);
 }
 
+#ifdef CONFIG_PPC_BOOK3S_64
+int __call_realmode(int (*fn)(void *arg), void *arg, void *sp);
+
+/* XXX: find a better place for this
+ * Executing C code in real-mode in general Book3S-64 code can only be done
+ * via this function that switches the stack to one inside the real-mode-area,
+ * which may cover only a small first part of real memory on hash guest LPARs.
+ * fn must be NOKPROBES, must not access vmalloc or anything outside the RMA,
+ * probably shouldn't enable the MMU or interrupts, etc, and be very careful
+ * about calling other generic kernel or powerpc functions.
+ */
+int call_realmode(int (*fn)(void *arg), void *arg)
+{
+	unsigned long flags;
+	void *cursp, *emsp;
+	int ret;
+
+	/* Stack switch is only really required for HPT LPAR, but do it for all to help test coverage of tricky code */
+	cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+	emsp = (void *)(local_paca->emergency_sp - THREAD_SIZE);
+
+	/* XXX check_stack_overflow(); */
+
+	if (WARN_ON_ONCE(cursp == emsp))
+		return -EBUSY;
+
+	local_irq_save(flags);
+	hard_irq_disable();
+
+	ret = __call_realmode(fn, arg, emsp);
+
+	local_irq_restore(flags);
+
+	return ret;
+}
+#endif
+
 #ifdef CONFIG_PPC64
 /* Called with hard IRQs off */
 void notrace __ppc64_runlatch_on(void)
-- 
2.23.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()
  2021-02-11 13:51 ` [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp() Michael Ellerman
@ 2021-02-16  5:39   ` Daniel Axtens
  2021-02-18 23:25     ` Michael Ellerman
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Axtens @ 2021-02-16  5:39 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar

Michael Ellerman <mpe@ellerman.id.au> writes:

> The flags argument to plpar_pte_protect() (aka. H_PROTECT), includes
> the key in bits 9-13, but currently we always set those bits to zero.
>
> In the past that hasn't been a problem because we always used key 0
> for the kernel, and updateboltedpp() is only used for kernel mappings.
>
> However since commit d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3
> for kernel mapping with hash translation") we are now inadvertently
> changing the key (to zero) when we call plpar_pte_protect().
>
> That hasn't broken anything because updateboltedpp() is only used for
> STRICT_KERNEL_RWX, which is currently disabled on 64s due to other
> bugs.
>
> But we want to fix that, so first we need to pass the key correctly to
> plpar_pte_protect(). In the `newpp` value the low 3 bits of the key
> are already in the correct spot, but the high 2 bits of the key need
> to be shifted down.
>
> Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/platforms/pseries/lpar.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
> index 764170fdb0f7..8bbbddff7226 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -976,11 +976,13 @@ static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp,
>  	slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
>  	BUG_ON(slot == -1);
>  
> -	flags = newpp & 7;
> +	flags = newpp & (HPTE_R_PP | HPTE_R_N);
>  	if (mmu_has_feature(MMU_FTR_KERNEL_RO))
>  		/* Move pp0 into bit 8 (IBM 55) */
>  		flags |= (newpp & HPTE_R_PP0) >> 55;
>  
> +	flags |= ((newpp & HPTE_R_KEY_HI) >> 48) | (newpp & HPTE_R_KEY_LO);
> +

I'm really confused about how these bits are getting packed into the
flags parameter. It seems to match how they are unpacked in
kvmppc_h_pr_protect, but I cannot figure out why they are packed in that
order, and the LoPAR doesn't seem especially illuminating on this topic
- although I may have missed the relevant section.

Kind regards,
Daniel

>  	lpar_rc = plpar_pte_protect(flags, slot, 0);
>  
>  	BUG_ON(lpar_rc != H_SUCCESS);
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro()
  2021-02-11 13:51 ` [PATCH 3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro() Michael Ellerman
@ 2021-02-16  5:50   ` Daniel Axtens
  0 siblings, 0 replies; 22+ messages in thread
From: Daniel Axtens @ 2021-02-16  5:50 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar

Hi Michael,

> In hash__mark_rodata_ro() we pass the raw PP_RXXX value to
> hash__change_memory_range(). That has the effect of setting the key to
> zero, because PP_RXXX contains no key value.
>
> Fix it by using htab_convert_pte_flags(), which knows how to convert a
> pgprot into a pp value, including the key.

So far as I can tell by chasing the definitions around, this appears
to do what it claims to do.

So, for what it's worth:
Reviewed-by: Daniel Axtens <dja@axtens.net>

Kind regards,
Daniel

>
> Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 567e0c6b3978..03819c259f0a 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -428,12 +428,14 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
>  
>  void hash__mark_rodata_ro(void)
>  {
> -	unsigned long start, end;
> +	unsigned long start, end, pp;
>  
>  	start = (unsigned long)_stext;
>  	end = (unsigned long)__init_begin;
>  
> -	WARN_ON(!hash__change_memory_range(start, end, PP_RXXX));
> +	pp = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL_ROX), HPTE_USE_KERNEL_KEY);
> +
> +	WARN_ON(!hash__change_memory_range(start, end, pp));
>  }
>  
>  void hash__mark_initmem_nx(void)
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()
  2021-02-16  5:39   ` Daniel Axtens
@ 2021-02-18 23:25     ` Michael Ellerman
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-02-18 23:25 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: aneesh.kumar

Daniel Axtens <dja@axtens.net> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
>> The flags argument to plpar_pte_protect() (aka. H_PROTECT), includes
>> the key in bits 9-13, but currently we always set those bits to zero.
>>
>> In the past that hasn't been a problem because we always used key 0
>> for the kernel, and updateboltedpp() is only used for kernel mappings.
>>
>> However since commit d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3
>> for kernel mapping with hash translation") we are now inadvertently
>> changing the key (to zero) when we call plpar_pte_protect().
>>
>> That hasn't broken anything because updateboltedpp() is only used for
>> STRICT_KERNEL_RWX, which is currently disabled on 64s due to other
>> bugs.
>>
>> But we want to fix that, so first we need to pass the key correctly to
>> plpar_pte_protect(). In the `newpp` value the low 3 bits of the key
>> are already in the correct spot, but the high 2 bits of the key need
>> to be shifted down.
>>
>> Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping with hash translation")
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> ---
>>  arch/powerpc/platforms/pseries/lpar.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
>> index 764170fdb0f7..8bbbddff7226 100644
>> --- a/arch/powerpc/platforms/pseries/lpar.c
>> +++ b/arch/powerpc/platforms/pseries/lpar.c
>> @@ -976,11 +976,13 @@ static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp,
>>  	slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
>>  	BUG_ON(slot == -1);
>>  
>> -	flags = newpp & 7;
>> +	flags = newpp & (HPTE_R_PP | HPTE_R_N);
>>  	if (mmu_has_feature(MMU_FTR_KERNEL_RO))
>>  		/* Move pp0 into bit 8 (IBM 55) */
>>  		flags |= (newpp & HPTE_R_PP0) >> 55;
>>  
>> +	flags |= ((newpp & HPTE_R_KEY_HI) >> 48) | (newpp & HPTE_R_KEY_LO);
>> +
>
> I'm really confused about how these bits are getting packed into the
> flags parameter. It seems to match how they are unpacked in
> kvmppc_h_pr_protect, but I cannot figure out why they are packed in that
> order, and the LoPAR doesn't seem especially illuminating on this topic
> - although I may have missed the relevant section.

Yeah I agree it's not very clearly specified.

The hcall we're using here is H_PROTECT, which is specified in section
14.5.4.1.6 of LoPAPR v1.1.

It takes a `flags` parameter, and the description for flags says:

 * flags: AVPN, pp0, pp1, pp2, key0-key4, n, and for the CMO
   option: CMO Option flags as defined in Table 189‚


If you then go to the start of the parent section, 14.5.4.1, on page
405, it says:

Register Linkage (For hcall() tokens 0x04 - 0x18)
 * On Call
   * R3 function call token
   * R4 flags (see Table 178‚ “Page Frame Table Access flags field definition‚” on page 401)


Then you have to go to section 14.5.3, and on page 394 there is a list
of hcalls and their tokens (table 176), and there you can see that
H_PROTECT == 0x18.

Finally you can look at table 178, on page 401, where it specifies the
layout of the bits for the key:

 Bit     Function
------------------
 50-54 | key0-key4


Those are big-endian bit numbers, converting to normal bit numbers you
get bits 9-13, or 0x3e00.

If you look at the kernel source we have:

#define HPTE_R_KEY_HI		ASM_CONST(0x3000000000000000)
#define HPTE_R_KEY_LO		ASM_CONST(0x0000000000000e00)

So the LO bits are already in the right place, and the HI bits just need
to be shifted down by 48.

Hope that makes it clearer :)

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range()
  2021-02-11 13:51 ` [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range() Michael Ellerman
@ 2021-02-19  2:08   ` Daniel Axtens
  2021-03-16  6:30     ` Michael Ellerman
  0 siblings, 1 reply; 22+ messages in thread
From: Daniel Axtens @ 2021-02-19  2:08 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar

Michael Ellerman <mpe@ellerman.id.au> writes:

> Pull the loop calling hpte_updateboltedpp() out of
> hash__change_memory_range() into a helper function. We need it to be a
> separate function for the next patch.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 03819c259f0a..3663d3cdffac 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -400,10 +400,23 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +static void change_memory_range(unsigned long start, unsigned long end,
> +				unsigned int step, unsigned long newpp)

Looking at the call paths, this gets called only in bare metal, not
virtualised: should the name reflect that?

> +{
> +	unsigned long idx;
> +
> +	pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 0x%x\n",
> +		 start, end, newpp, step);
> +
> +	for (idx = start; idx < end; idx += step)
> +		/* Not sure if we can do much with the return value */

Hmm, I realise this comment isn't changed, but it did make me wonder
what the return value!

It turns out that the function doesn't actually return anything.

Tracking back the history of hpte_updateboltedpp, it looks like it has
not had a return value since the start of git history:

^1da177e4c3f4 include/asm-ppc64/machdep.h    void            (*hpte_updateboltedpp)(unsigned long newpp, 
3c726f8dee6f5 include/asm-powerpc/machdep.h                                         unsigned long ea,
1189be6508d45 include/asm-powerpc/machdep.h                                        int psize, int ssize);

The comment comes from commit cd65d6971334 ("powerpc/mm/hash: Implement
mark_rodata_ro() for hash") where Balbir added the comment, but again I
can't figure out what sort of return value there would be to ignore.

Should we drop the comment? (or return something from hpte_updateboltedpp)

> +		mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
> +							mmu_kernel_ssize);
> +}
> +
>  static bool hash__change_memory_range(unsigned long start, unsigned long end,
>  				      unsigned long newpp)
>  {
> -	unsigned long idx;
>  	unsigned int step, shift;
>  
>  	shift = mmu_psize_defs[mmu_linear_psize].shift;
> @@ -415,13 +428,7 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
>  	if (start >= end)
>  		return false;
>  
> -	pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 0x%x\n",
> -		 start, end, newpp, step);
> -
> -	for (idx = start; idx < end; idx += step)
> -		/* Not sure if we can do much with the return value */
> -		mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
> -							mmu_kernel_ssize);
> +	change_memory_range(start, end, step, newpp);

Looking at how change_memory_range is called, step is derived by:

	shift = mmu_psize_defs[mmu_linear_psize].shift;
	step = 1 << shift;

We probably therefore don't really need to pass step in to
change_memory_range. Having said that, I'm not sure it would really be that
much tidier to compute step in change_memory_range, especially since we
also need step for the other branch in hash__change_memory_range.

Beyond that it all looks reasonable to me!

I also checked that the loop operations made sense, I think they do - we
cover from start inclusive to end exclusive and the alignment is done
before we call into change_memory_range.

Regards,
Daniel

>  	return true;
>  }
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
  2021-02-11 23:16   ` Nicholas Piggin
  2021-02-12  0:36   ` Nicholas Piggin
@ 2021-02-19  2:43   ` Daniel Axtens
  2021-03-19 11:56     ` Michael Ellerman
  2 siblings, 1 reply; 22+ messages in thread
From: Daniel Axtens @ 2021-02-19  2:43 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar

Michael Ellerman <mpe@ellerman.id.au> writes:

> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
>
> One example, which was caught in xmon:
>
>   [   14.068327][    T1] devtmpfs: mounted
>   [   14.069302][    T1] Freeing unused kernel memory: 5568K
>   [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
>   [   14.142063][    T1] Run /sbin/init as init process
>   [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
>   cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>       pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>       lr: c0000000001862d4: update_rq_clock+0x44/0x110
>       sp: c00000000c747880
>      msr: 8000000040001031
>     current = 0xc00000000c60d380
>     paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
>       pid   = 347, comm = kworker/2:1
>   ...
>   enter ? for help
>   [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>   [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>   [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>   [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>   [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>   [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>   [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>   [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>   [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>   [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>   [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
>
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
>
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.

Jordan asked why we saw this on phyp but not under KVM? We had a look at
book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
about!

Nick suggests that the KVM hypervisor is invalidating the HPTE, but
because we run guests in VPM mode, the hypervisor would catch the page
fault and not reflect it down to the guest. It looks like Linux-as-a-HV
will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
hypervisor will catch the fault and not pass it to the guest. But if
phyp runs with VPM mode off, the guest will see the fault before the
hypervisor. (we think this is what's going on anyway.)

We spent a while pondering if phyp is doing something buggy or not...
Looking at the PAPR definition of H_PROTECT, that claims the hypervisor
will do the 'architected “Modifying a Page Table Entry General Case”
sequence'. s 5.10.1.2 of Book IIIS of the ISAv3 defines that, and the
non-atomic hardware sequence does indeed modify the PTE by going through
the invalid state. So it looks like if phyp is running without VPM mode
it's technically not buggy.

Hopefully I'll get to have a look at the rest of the patch shortly!

Kind regards,
Daniel

> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
>  1 file changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
>  #include <linux/mm.h>
> +#include <linux/stop_machine.h>
>  
>  #include <asm/sections.h>
>  #include <asm/mmu.h>
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> +	unsigned long start, end, newpp;
> +	unsigned int step, nr_cpus, master_cpu;
> +	atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
>  static void change_memory_range(unsigned long start, unsigned long end,
>  				unsigned int step, unsigned long newpp)
>  {
> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>  							mmu_kernel_ssize);
>  }
>  
> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
> +{
> +	unsigned long msr, tmp, flags;
> +	int *p;
> +
> +	p = &parms->cpu_counter.counter;
> +
> +	local_irq_save(flags);
> +	__hard_EE_RI_disable();
> +
> +	asm volatile (
> +	// Switch to real mode and leave interrupts off
> +	"mfmsr	%[msr]			;"
> +	"li	%[tmp], %[MSR_IR_DR]	;"
> +	"andc	%[tmp], %[msr], %[tmp]	;"
> +	"mtmsrd %[tmp]			;"
> +
> +	// Tell the master we are in real mode
> +	"1:				"
> +	"lwarx	%[tmp], 0, %[p]		;"
> +	"addic	%[tmp], %[tmp], -1	;"
> +	"stwcx.	%[tmp], 0, %[p]		;"
> +	"bne-	1b			;"
> +
> +	// Spin until the counter goes to zero
> +	"2:				;"
> +	"lwz	%[tmp], 0(%[p])		;"
> +	"cmpwi	%[tmp], 0		;"
> +	"bne-	2b			;"
> +
> +	// Switch back to virtual mode
> +	"mtmsrd %[msr]			;"
> +
> +	: // outputs
> +	  [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
> +	: // inputs
> +	  [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
> +	: // clobbers
> +	  "cc", "xer"
> +	);
> +
> +	local_irq_restore(flags);
> +
> +	return 0;
> +}
> +
> +static int change_memory_range_fn(void *data)
> +{
> +	struct change_memory_parms *parms = data;
> +
> +	if (parms->master_cpu != smp_processor_id())
> +		return chmem_secondary_loop(parms);
> +
> +	// Wait for all but one CPU (this one) to call-in
> +	while (atomic_read(&parms->cpu_counter) > 1)
> +		barrier();
> +
> +	change_memory_range(parms->start, parms->end, parms->step, parms->newpp);
> +
> +	mb();
> +
> +	// Signal the other CPUs that we're done
> +	atomic_dec(&parms->cpu_counter);
> +
> +	return 0;
> +}
> +
>  static bool hash__change_memory_range(unsigned long start, unsigned long end,
>  				      unsigned long newpp)
>  {
> @@ -428,7 +509,29 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
>  	if (start >= end)
>  		return false;
>  
> -	change_memory_range(start, end, step, newpp);
> +	if (firmware_has_feature(FW_FEATURE_LPAR)) {
> +		mutex_lock(&chmem_lock);
> +
> +		chmem_parms.start = start;
> +		chmem_parms.end = end;
> +		chmem_parms.step = step;
> +		chmem_parms.newpp = newpp;
> +		chmem_parms.master_cpu = smp_processor_id();
> +
> +		cpus_read_lock();
> +
> +		atomic_set(&chmem_parms.cpu_counter, num_online_cpus());
> +
> +		// Ensure state is consistent before we call the other CPUs
> +		mb();
> +
> +		stop_machine_cpuslocked(change_memory_range_fn, &chmem_parms,
> +					cpu_online_mask);
> +
> +		cpus_read_unlock();
> +		mutex_unlock(&chmem_lock);
> +	} else
> +		change_memory_range(start, end, step, newpp);
>  
>  	return true;
>  }
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range()
  2021-02-19  2:08   ` Daniel Axtens
@ 2021-03-16  6:30     ` Michael Ellerman
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-03-16  6:30 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: aneesh.kumar

Daniel Axtens <dja@axtens.net> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
>> Pull the loop calling hpte_updateboltedpp() out of
>> hash__change_memory_range() into a helper function. We need it to be a
>> separate function for the next patch.
>>
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> ---
>>  arch/powerpc/mm/book3s64/hash_pgtable.c | 23 +++++++++++++++--------
>>  1 file changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> index 03819c259f0a..3663d3cdffac 100644
>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> @@ -400,10 +400,23 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>>  #ifdef CONFIG_STRICT_KERNEL_RWX
>> +static void change_memory_range(unsigned long start, unsigned long end,
>> +				unsigned int step, unsigned long newpp)
>
> Looking at the call paths, this gets called only in bare metal, not
> virtualised: should the name reflect that?

It's also called on bare metal:

static bool hash__change_memory_range(unsigned long start, unsigned long end,
				      unsigned long newpp)
{
	...
	if (firmware_has_feature(FW_FEATURE_LPAR)) {
	        ...
		stop_machine_cpuslocked(change_memory_range_fn, &chmem_parms,
					cpu_online_mask);
	        ...
	} else
		change_memory_range(start, end, step, newpp);
                ^^^^^^^^^^^^^^^^^^^


>> +{
>> +	unsigned long idx;
>> +
>> +	pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 0x%x\n",
>> +		 start, end, newpp, step);
>> +
>> +	for (idx = start; idx < end; idx += step)
>> +		/* Not sure if we can do much with the return value */
>
> Hmm, I realise this comment isn't changed, but it did make me wonder
> what the return value!
>
> It turns out that the function doesn't actually return anything.
>
> Tracking back the history of hpte_updateboltedpp, it looks like it has
> not had a return value since the start of git history:
>
> ^1da177e4c3f4 include/asm-ppc64/machdep.h    void            (*hpte_updateboltedpp)(unsigned long newpp, 
> 3c726f8dee6f5 include/asm-powerpc/machdep.h                                         unsigned long ea,
> 1189be6508d45 include/asm-powerpc/machdep.h                                        int psize, int ssize);
>
> The comment comes from commit cd65d6971334 ("powerpc/mm/hash: Implement
> mark_rodata_ro() for hash") where Balbir added the comment, but again I
> can't figure out what sort of return value there would be to ignore.

I suspect he just assumed there was a return value, and the comment is
saying we aren't really allowed to fail here, so what could we do?

In general these routines changing the kernel map permissions aren't
allowed to fail, because the callers don't cope with a failure, and at
least in some cases eg. RW -> RX the permission change is not optional.

> Should we drop the comment? (or return something from hpte_updateboltedpp)

I'll leave the comment for now, but we could probably drop it.

It would be good if hpte_updateboltedpp() could fail and return an
error. Currently pSeries_lpar_hpte_updateboltedpp() BUGs if the hcall
fails, because the only error cases are due to bad input on our part.
And similarly native_hpte_updateboltedpp() panics if we give it bad
input.

We may still need to panic() at a higher level, ie. adding execute to a
mapping is not optional. But possibly for some changes, like RW->RO we
could WARN and continue.

And I guess for modules we could eventually plumb the error all the way
out and fail the module load.

>> +		mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
>> +							mmu_kernel_ssize);
>> +}
>> +
>>  static bool hash__change_memory_range(unsigned long start, unsigned long end,
>>  				      unsigned long newpp)
>>  {
>> -	unsigned long idx;
>>  	unsigned int step, shift;
>>  
>>  	shift = mmu_psize_defs[mmu_linear_psize].shift;
>> @@ -415,13 +428,7 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
>>  	if (start >= end)
>>  		return false;
>>  
>> -	pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 0x%x\n",
>> -		 start, end, newpp, step);
>> -
>> -	for (idx = start; idx < end; idx += step)
>> -		/* Not sure if we can do much with the return value */
>> -		mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
>> -							mmu_kernel_ssize);
>> +	change_memory_range(start, end, step, newpp);
>
> Looking at how change_memory_range is called, step is derived by:
>
> 	shift = mmu_psize_defs[mmu_linear_psize].shift;
> 	step = 1 << shift;
>
> We probably therefore don't really need to pass step in to
> change_memory_range. Having said that, I'm not sure it would really be that
> much tidier to compute step in change_memory_range, especially since we
> also need step for the other branch in hash__change_memory_range.

Hmm yeah, swings and roundabouts. I think I'll leave it as is, so that
we're only calculating step in one place.

> Beyond that it all looks reasonable to me!
>
> I also checked that the loop operations made sense, I think they do - we
> cover from start inclusive to end exclusive and the alignment is done
> before we call into change_memory_range.

Thanks.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-12  0:36   ` Nicholas Piggin
@ 2021-03-16  6:40     ` Michael Ellerman
  2021-03-22  3:09       ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-03-16  6:40 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: aneesh.kumar

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>> failures when using the Hash MMU and running under phyp. The crashes
>> are intermittent, and often exhibit as a completely unresponsive
>> system, or possibly an oops.
...
>> 
>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> index 3663d3cdffac..01de985df2c4 100644
>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>  							mmu_kernel_ssize);
>>  }
>>  
>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>> +{
>> +	unsigned long msr, tmp, flags;
>> +	int *p;
>> +
>> +	p = &parms->cpu_counter.counter;
>> +
>> +	local_irq_save(flags);
>> +	__hard_EE_RI_disable();
>> +
>> +	asm volatile (
>> +	// Switch to real mode and leave interrupts off
>> +	"mfmsr	%[msr]			;"
>> +	"li	%[tmp], %[MSR_IR_DR]	;"
>> +	"andc	%[tmp], %[msr], %[tmp]	;"
>> +	"mtmsrd %[tmp]			;"
>> +
>> +	// Tell the master we are in real mode
>> +	"1:				"
>> +	"lwarx	%[tmp], 0, %[p]		;"
>> +	"addic	%[tmp], %[tmp], -1	;"
>> +	"stwcx.	%[tmp], 0, %[p]		;"
>> +	"bne-	1b			;"
>> +
>> +	// Spin until the counter goes to zero
>> +	"2:				;"
>> +	"lwz	%[tmp], 0(%[p])		;"
>> +	"cmpwi	%[tmp], 0		;"
>> +	"bne-	2b			;"
>> +
>> +	// Switch back to virtual mode
>> +	"mtmsrd %[msr]			;"
>
> Pity we don't have something that can switch to emergency stack and
> so we can write this stuff in C.
>
> How's something like this suit you?

It looks like it would be really good for writing exploits :)

I think at the very least we would want the asm part to load the SP
from the paca itself, rather than taking it as a parameter.

But I'm not sure writing these type of things in C is a big win, because
you have to be so careful about what you call anyway. It's almost better
in asm because it's so restrictive.

Obviously having said that, my first attempt got the IRQ save/restore
wrong, so maybe we should at least have some macros to help with it.

Did you have another user for this in mind? The only one that I can
think of at the moment is the subcore stuff.

cheers

> diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
> index 070465825c21..5e911d0b0b16 100644
> --- a/arch/powerpc/kernel/misc_64.S
> +++ b/arch/powerpc/kernel/misc_64.S
> @@ -27,6 +27,28 @@
>  
>  	.text
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +_GLOBAL(__call_realmode)
> +	mflr	r0
> +	std	r0,16(r1)
> +	stdu	r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
> +	mr	r1,r5
> +	mtctr	r3
> +	mr	r3,r4
> +	mfmsr	r4
> +	xori	r4,r4,(MSR_IR|MSR_DR)
> +	mtmsrd	r4
> +	bctrl
> +	mfmsr	r4
> +	xori	r4,r4,(MSR_IR|MSR_DR)
> +	mtmsrd	r4
> +	ld	r1,0(r1)
> +	ld	r0,16(r1)
> +	mtlr	r0
> +	blr
> +
> +#endif
> +
>  _GLOBAL(call_do_softirq)
>  	mflr	r0
>  	std	r0,16(r1)
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index a66f435dabbf..260d60f665a3 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -2197,6 +2197,43 @@ void show_stack(struct task_struct *tsk, unsigned long *stack,
>  	put_task_stack(tsk);
>  }
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +int __call_realmode(int (*fn)(void *arg), void *arg, void *sp);
> +
> +/* XXX: find a better place for this
> + * Executing C code in real-mode in general Book3S-64 code can only be done
> + * via this function that switches the stack to one inside the real-mode-area,
> + * which may cover only a small first part of real memory on hash guest LPARs.
> + * fn must be NOKPROBES, must not access vmalloc or anything outside the RMA,
> + * probably shouldn't enable the MMU or interrupts, etc, and be very careful
> + * about calling other generic kernel or powerpc functions.
> + */
> +int call_realmode(int (*fn)(void *arg), void *arg)
> +{
> +	unsigned long flags;
> +	void *cursp, *emsp;
> +	int ret;
> +
> +	/* Stack switch is only really required for HPT LPAR, but do it for all to help test coverage of tricky code */
> +	cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
> +	emsp = (void *)(local_paca->emergency_sp - THREAD_SIZE);
> +
> +	/* XXX check_stack_overflow(); */
> +
> +	if (WARN_ON_ONCE(cursp == emsp))
> +		return -EBUSY;
> +
> +	local_irq_save(flags);
> +	hard_irq_disable();
> +
> +	ret = __call_realmode(fn, arg, emsp);
> +
> +	local_irq_restore(flags);
> +
> +	return ret;
> +}
> +#endif
> +
>  #ifdef CONFIG_PPC64
>  /* Called with hard IRQs off */
>  void notrace __ppc64_runlatch_on(void)
> -- 
> 2.23.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-19  2:43   ` Daniel Axtens
@ 2021-03-19 11:56     ` Michael Ellerman
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-03-19 11:56 UTC (permalink / raw)
  To: Daniel Axtens, linuxppc-dev; +Cc: aneesh.kumar

Daniel Axtens <dja@axtens.net> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>> failures when using the Hash MMU and running under phyp. The crashes
>> are intermittent, and often exhibit as a completely unresponsive
>> system, or possibly an oops.
>>
>> One example, which was caught in xmon:
>>
>>   [   14.068327][    T1] devtmpfs: mounted
>>   [   14.069302][    T1] Freeing unused kernel memory: 5568K
>>   [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
>>   [   14.142063][    T1] Run /sbin/init as init process
>>   [   14.142074][  T347] Faulting instruction address: 0xc000000000004400
>>   cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>>       pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>>       lr: c0000000001862d4: update_rq_clock+0x44/0x110
>>       sp: c00000000c747880
>>      msr: 8000000040001031
>>     current = 0xc00000000c60d380
>>     paca    = 0xc00000001ec9de80   irqmask: 0x03   irq_happened: 0x01
>>       pid   = 347, comm = kworker/2:1
>>   ...
>>   enter ? for help
>>   [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>>   [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>>   [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>>   [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>>   [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>>   [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>>   [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>>   [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>>   [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>>   [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>>   [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>>
>> This shows that CPU 2, which was idle, woke up and then appears to
>> randomly take an instruction fault on a completely valid area of
>> kernel text.
>>
>> The cause turns out to be the call to hash__mark_rodata_ro(), late in
>> boot. Due to the way we layout text and rodata, that function actually
>> changes the permissions for all of text and rodata to read-only plus
>> execute.
>>
>> To do the permission change we use a hypervisor call, H_PROTECT. On
>> phyp that appears to be implemented by briefly removing the mapping of
>> the kernel text, before putting it back with the updated permissions.
>> If any other CPU is executing during that window, it will see spurious
>> faults on the kernel text and/or data, leading to crashes.
>
> Jordan asked why we saw this on phyp but not under KVM? We had a look at
> book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
> about!
>
> Nick suggests that the KVM hypervisor is invalidating the HPTE, but
> because we run guests in VPM mode, the hypervisor would catch the page
> fault and not reflect it down to the guest. It looks like Linux-as-a-HV
> will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
> hypervisor will catch the fault and not pass it to the guest.

Yep.

> But if phyp runs with VPM mode off, the guest will see the fault
> before the hypervisor. (we think this is what's going on anyway.)

Yeah. I assumed phyp always ran with VPM=1, but apparently it can run
with it off or on, depending on various configuration settings.

So I'm fairly sure what we're hitting here is VPM=0, where the faults go
straight to the guest.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-02-11 23:16   ` Nicholas Piggin
@ 2021-03-20 13:04     ` Michael Ellerman
  2021-03-22  2:56       ` Nicholas Piggin
  0 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-03-20 13:04 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: aneesh.kumar

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
...
>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> index 3663d3cdffac..01de985df2c4 100644
>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>  							mmu_kernel_ssize);
>>  }
>>  
>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>> +{
>> +	unsigned long msr, tmp, flags;
>> +	int *p;
>> +
>> +	p = &parms->cpu_counter.counter;
>> +
>> +	local_irq_save(flags);
>> +	__hard_EE_RI_disable();
>> +
>> +	asm volatile (
>> +	// Switch to real mode and leave interrupts off
>> +	"mfmsr	%[msr]			;"
>> +	"li	%[tmp], %[MSR_IR_DR]	;"
>> +	"andc	%[tmp], %[msr], %[tmp]	;"
>> +	"mtmsrd %[tmp]			;"
>> +
>> +	// Tell the master we are in real mode
>> +	"1:				"
>> +	"lwarx	%[tmp], 0, %[p]		;"
>> +	"addic	%[tmp], %[tmp], -1	;"
>> +	"stwcx.	%[tmp], 0, %[p]		;"
>> +	"bne-	1b			;"
>> +
>> +	// Spin until the counter goes to zero
>> +	"2:				;"
>> +	"lwz	%[tmp], 0(%[p])		;"
>> +	"cmpwi	%[tmp], 0		;"
>> +	"bne-	2b			;"
>> +
>> +	// Switch back to virtual mode
>> +	"mtmsrd %[msr]			;"
>> +
>> +	: // outputs
>> +	  [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
>> +	: // inputs
>> +	  [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
>> +	: // clobbers
>> +	  "cc", "xer"
>> +	);
>> +
>> +	local_irq_restore(flags);
>
> Hmm. __hard_EE_RI_disable won't get restored by this because it doesn't
> set the HARD_DIS flag. Also we don't want RI disabled here because 
> tracing will get called first (which might take SLB or HPTE fault).

Thanks for noticing. I originally wrote hard_irq_disable() but then
thought disabling RI also would be good.

> But it's also slightly rude to ever enable EE under an irq soft mask,
> because you don't know if it had been disabled by the masked interrupt 
> handler. It's not strictly a problem AFAIK because the interrupt would
> just get masked again, but if we try to maintain a good pattern would
> be good. Hmm that means we should add a check for irqs soft masked in
> __hard_irq_enable(), I'm not sure if all existing users would follow
> this rule.
>
> Might be better to call hard_irq_disable(); after the local_irq_save();
> and then clear and reset RI inside that region (could just do it at the
> same time as disabling MMU).

Thinking about it more, there's no real reason to disable RI.

We should be able to return from an interrupt in there, it's just that
if we do take one we'll probably die before we get a chance to return
because the mapping of text will be missing.

So disabling RI doesn't really gain us anything I don't think.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-03-20 13:04     ` Michael Ellerman
@ 2021-03-22  2:56       ` Nicholas Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2021-03-22  2:56 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar

Excerpts from Michael Ellerman's message of March 20, 2021 11:04 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
> ...
>>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> index 3663d3cdffac..01de985df2c4 100644
>>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>>  							mmu_kernel_ssize);
>>>  }
>>>  
>>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>>> +{
>>> +	unsigned long msr, tmp, flags;
>>> +	int *p;
>>> +
>>> +	p = &parms->cpu_counter.counter;
>>> +
>>> +	local_irq_save(flags);
>>> +	__hard_EE_RI_disable();
>>> +
>>> +	asm volatile (
>>> +	// Switch to real mode and leave interrupts off
>>> +	"mfmsr	%[msr]			;"
>>> +	"li	%[tmp], %[MSR_IR_DR]	;"
>>> +	"andc	%[tmp], %[msr], %[tmp]	;"
>>> +	"mtmsrd %[tmp]			;"
>>> +
>>> +	// Tell the master we are in real mode
>>> +	"1:				"
>>> +	"lwarx	%[tmp], 0, %[p]		;"
>>> +	"addic	%[tmp], %[tmp], -1	;"
>>> +	"stwcx.	%[tmp], 0, %[p]		;"
>>> +	"bne-	1b			;"
>>> +
>>> +	// Spin until the counter goes to zero
>>> +	"2:				;"
>>> +	"lwz	%[tmp], 0(%[p])		;"
>>> +	"cmpwi	%[tmp], 0		;"
>>> +	"bne-	2b			;"
>>> +
>>> +	// Switch back to virtual mode
>>> +	"mtmsrd %[msr]			;"
>>> +
>>> +	: // outputs
>>> +	  [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
>>> +	: // inputs
>>> +	  [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
>>> +	: // clobbers
>>> +	  "cc", "xer"
>>> +	);
>>> +
>>> +	local_irq_restore(flags);
>>
>> Hmm. __hard_EE_RI_disable won't get restored by this because it doesn't
>> set the HARD_DIS flag. Also we don't want RI disabled here because 
>> tracing will get called first (which might take SLB or HPTE fault).
> 
> Thanks for noticing. I originally wrote hard_irq_disable() but then
> thought disabling RI also would be good.
> 
>> But it's also slightly rude to ever enable EE under an irq soft mask,
>> because you don't know if it had been disabled by the masked interrupt 
>> handler. It's not strictly a problem AFAIK because the interrupt would
>> just get masked again, but if we try to maintain a good pattern would
>> be good. Hmm that means we should add a check for irqs soft masked in
>> __hard_irq_enable(), I'm not sure if all existing users would follow
>> this rule.
>>
>> Might be better to call hard_irq_disable(); after the local_irq_save();
>> and then clear and reset RI inside that region (could just do it at the
>> same time as disabling MMU).
> 
> Thinking about it more, there's no real reason to disable RI.
> 
> We should be able to return from an interrupt in there, it's just that
> if we do take one we'll probably die before we get a chance to return
> because the mapping of text will be missing.

Yeah it probably will because the pseries hash machine check handler has 
some hacks in it that require turning the MMU on. We might never fix 
that if we're moving to radix, but if we did then in theory we'd be able 
to take a MCE here and recover.

> So disabling RI doesn't really gain us anything I don't think.

Yeah I probably agree. So local_irq_save(flags); hard_irq_disable(); 
should do the trick.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-03-16  6:40     ` Michael Ellerman
@ 2021-03-22  3:09       ` Nicholas Piggin
  2021-03-22  9:07         ` Michael Ellerman
  0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2021-03-22  3:09 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar

Excerpts from Michael Ellerman's message of March 16, 2021 4:40 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
>>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>>> failures when using the Hash MMU and running under phyp. The crashes
>>> are intermittent, and often exhibit as a completely unresponsive
>>> system, or possibly an oops.
> ...
>>> 
>>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> index 3663d3cdffac..01de985df2c4 100644
>>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>>  							mmu_kernel_ssize);
>>>  }
>>>  
>>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>>> +{
>>> +	unsigned long msr, tmp, flags;
>>> +	int *p;
>>> +
>>> +	p = &parms->cpu_counter.counter;
>>> +
>>> +	local_irq_save(flags);
>>> +	__hard_EE_RI_disable();
>>> +
>>> +	asm volatile (
>>> +	// Switch to real mode and leave interrupts off
>>> +	"mfmsr	%[msr]			;"
>>> +	"li	%[tmp], %[MSR_IR_DR]	;"
>>> +	"andc	%[tmp], %[msr], %[tmp]	;"
>>> +	"mtmsrd %[tmp]			;"
>>> +
>>> +	// Tell the master we are in real mode
>>> +	"1:				"
>>> +	"lwarx	%[tmp], 0, %[p]		;"
>>> +	"addic	%[tmp], %[tmp], -1	;"
>>> +	"stwcx.	%[tmp], 0, %[p]		;"
>>> +	"bne-	1b			;"
>>> +
>>> +	// Spin until the counter goes to zero
>>> +	"2:				;"
>>> +	"lwz	%[tmp], 0(%[p])		;"
>>> +	"cmpwi	%[tmp], 0		;"
>>> +	"bne-	2b			;"
>>> +
>>> +	// Switch back to virtual mode
>>> +	"mtmsrd %[msr]			;"
>>
>> Pity we don't have something that can switch to emergency stack and
>> so we can write this stuff in C.
>>
>> How's something like this suit you?
> 
> It looks like it would be really good for writing exploits :)

Hmm. In that case maybe the callee function could be inlined into it 
like the interrupt wrappers, and the asm real-mode entry/exit gets
added around it rather than have this little exploit stub. So similar to 
yours but with a stack switch as well so you can come back up in real 
mode.

> I think at the very least we would want the asm part to load the SP
> from the paca itself, rather than taking it as a parameter.
> 
> But I'm not sure writing these type of things in C is a big win, because
> you have to be so careful about what you call anyway. It's almost better
> in asm because it's so restrictive.
> 
> Obviously having said that, my first attempt got the IRQ save/restore
> wrong, so maybe we should at least have some macros to help with it.
> 
> Did you have another user for this in mind? The only one that I can
> think of at the moment is the subcore stuff.

Possibly rtas entry/exit (although that has other issues). But I guess 
it's not a huge amount of asm compared with what I'm dealing with.

I'm okay if you just put your thing in at the moment, we might or might 
not get keen and c-ify it later.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
  2021-03-22  3:09       ` Nicholas Piggin
@ 2021-03-22  9:07         ` Michael Ellerman
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-03-22  9:07 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: aneesh.kumar

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Michael Ellerman's message of March 16, 2021 4:40 pm:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
>>>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>>>> failures when using the Hash MMU and running under phyp. The crashes
>>>> are intermittent, and often exhibit as a completely unresponsive
>>>> system, or possibly an oops.
>> ...
>>>> 
>>>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>>> index 3663d3cdffac..01de985df2c4 100644
>>>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>>>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>>>  							mmu_kernel_ssize);
>>>>  }
>>>>  
>>>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>>>> +{
>>>> +	unsigned long msr, tmp, flags;
>>>> +	int *p;
>>>> +
>>>> +	p = &parms->cpu_counter.counter;
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	__hard_EE_RI_disable();
>>>> +
>>>> +	asm volatile (
>>>> +	// Switch to real mode and leave interrupts off
>>>> +	"mfmsr	%[msr]			;"
>>>> +	"li	%[tmp], %[MSR_IR_DR]	;"
>>>> +	"andc	%[tmp], %[msr], %[tmp]	;"
>>>> +	"mtmsrd %[tmp]			;"
>>>> +
>>>> +	// Tell the master we are in real mode
>>>> +	"1:				"
>>>> +	"lwarx	%[tmp], 0, %[p]		;"
>>>> +	"addic	%[tmp], %[tmp], -1	;"
>>>> +	"stwcx.	%[tmp], 0, %[p]		;"
>>>> +	"bne-	1b			;"
>>>> +
>>>> +	// Spin until the counter goes to zero
>>>> +	"2:				;"
>>>> +	"lwz	%[tmp], 0(%[p])		;"
>>>> +	"cmpwi	%[tmp], 0		;"
>>>> +	"bne-	2b			;"
>>>> +
>>>> +	// Switch back to virtual mode
>>>> +	"mtmsrd %[msr]			;"
>>>
>>> Pity we don't have something that can switch to emergency stack and
>>> so we can write this stuff in C.
>>>
>>> How's something like this suit you?
>> 
>> It looks like it would be really good for writing exploits :)
>
> Hmm. In that case maybe the callee function could be inlined into it 
> like the interrupt wrappers, and the asm real-mode entry/exit gets
> added around it rather than have this little exploit stub. So similar to 
> yours but with a stack switch as well so you can come back up in real 
> mode.

Yeah inlining as much as possible would reduce the risk.

>> I think at the very least we would want the asm part to load the SP
>> from the paca itself, rather than taking it as a parameter.
>> 
>> But I'm not sure writing these type of things in C is a big win, because
>> you have to be so careful about what you call anyway. It's almost better
>> in asm because it's so restrictive.
>> 
>> Obviously having said that, my first attempt got the IRQ save/restore
>> wrong, so maybe we should at least have some macros to help with it.
>> 
>> Did you have another user for this in mind? The only one that I can
>> think of at the moment is the subcore stuff.
>
> Possibly rtas entry/exit (although that has other issues). But I guess 
> it's not a huge amount of asm compared with what I'm dealing with.

Ah yep, I hadn't thought of RTAS.

> I'm okay if you just put your thing in at the moment, we might or might 
> not get keen and c-ify it later.

OK.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX
  2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
                   ` (4 preceding siblings ...)
  2021-02-11 13:51 ` [PATCH 6/6] powerpc/mm/64s: Allow STRICT_KERNEL_RWX again Michael Ellerman
@ 2021-04-10 14:28 ` Michael Ellerman
  2021-04-19  5:17   ` Michael Ellerman
  5 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-04-10 14:28 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar

On Fri, 12 Feb 2021 00:51:25 +1100, Michael Ellerman wrote:
> In the past we had a fallback definition for _PAGE_KERNEL_ROX, but we
> removed that in commit d82fd29c5a8c ("powerpc/mm: Distribute platform
> specific PAGE and PMD flags and definitions") and added definitions
> for each MMU family.
> 
> However we missed adding a definition for 64s, which was not really a
> bug because it's currently not used.
> 
> [...]

Applied to powerpc/next.

[1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX
      https://git.kernel.org/powerpc/c/56bec2f9d4d05675cada96772a8a93010f4d82bf
[2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()
      https://git.kernel.org/powerpc/c/b56d55a5aa4aa9fc166595a7feb57f153ef7b555
[3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro()
      https://git.kernel.org/powerpc/c/2c02e656a29d5f64193eb93da92781bcf0517146
[4/6] powerpc/mm/64s/hash: Factor out change_memory_range()
      https://git.kernel.org/powerpc/c/6f223ebe9c3f3ed315a06cec156086f1f7f7ded1
[5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
      https://git.kernel.org/powerpc/c/87e65ad7bd3a84a992723753fcc23d31c2d063c2
[6/6] powerpc/mm/64s: Allow STRICT_KERNEL_RWX again
      https://git.kernel.org/powerpc/c/bd573a81312fd9d6520b1cc81a88fd29e670e1ff

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX
  2021-04-10 14:28 ` [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
@ 2021-04-19  5:17   ` Michael Ellerman
  0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-04-19  5:17 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar

Michael Ellerman <patch-notifications@ellerman.id.au> writes:

> On Fri, 12 Feb 2021 00:51:25 +1100, Michael Ellerman wrote:
>> In the past we had a fallback definition for _PAGE_KERNEL_ROX, but we
>> removed that in commit d82fd29c5a8c ("powerpc/mm: Distribute platform
>> specific PAGE and PMD flags and definitions") and added definitions
>> for each MMU family.
>> 
>> However we missed adding a definition for 64s, which was not really a
>> bug because it's currently not used.
>> 
>> [...]
>
> Applied to powerpc/next.
>
> [1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX
>       https://git.kernel.org/powerpc/c/56bec2f9d4d05675cada96772a8a93010f4d82bf
> [2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()
>       https://git.kernel.org/powerpc/c/b56d55a5aa4aa9fc166595a7feb57f153ef7b555
> [3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro()
>       https://git.kernel.org/powerpc/c/2c02e656a29d5f64193eb93da92781bcf0517146
> [4/6] powerpc/mm/64s/hash: Factor out change_memory_range()
>       https://git.kernel.org/powerpc/c/6f223ebe9c3f3ed315a06cec156086f1f7f7ded1
> [5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
>       https://git.kernel.org/powerpc/c/87e65ad7bd3a84a992723753fcc23d31c2d063c2
> [6/6] powerpc/mm/64s: Allow STRICT_KERNEL_RWX again
>       https://git.kernel.org/powerpc/c/bd573a81312fd9d6520b1cc81a88fd29e670e1ff

v2 was applied.

cheers

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-11 13:51 [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
2021-02-11 13:51 ` [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp() Michael Ellerman
2021-02-16  5:39   ` Daniel Axtens
2021-02-18 23:25     ` Michael Ellerman
2021-02-11 13:51 ` [PATCH 3/6] powerpc/64s: Use htab_convert_pte_flags() in hash__mark_rodata_ro() Michael Ellerman
2021-02-16  5:50   ` Daniel Axtens
2021-02-11 13:51 ` [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range() Michael Ellerman
2021-02-19  2:08   ` Daniel Axtens
2021-03-16  6:30     ` Michael Ellerman
2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
2021-02-11 23:16   ` Nicholas Piggin
2021-03-20 13:04     ` Michael Ellerman
2021-03-22  2:56       ` Nicholas Piggin
2021-02-12  0:36   ` Nicholas Piggin
2021-03-16  6:40     ` Michael Ellerman
2021-03-22  3:09       ` Nicholas Piggin
2021-03-22  9:07         ` Michael Ellerman
2021-02-19  2:43   ` Daniel Axtens
2021-03-19 11:56     ` Michael Ellerman
2021-02-11 13:51 ` [PATCH 6/6] powerpc/mm/64s: Allow STRICT_KERNEL_RWX again Michael Ellerman
2021-04-10 14:28 ` [PATCH 1/6] powerpc/mm/64s: Add _PAGE_KERNEL_ROX Michael Ellerman
2021-04-19  5:17   ` Michael Ellerman

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git