* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
@ 2021-02-11 23:16 ` Nicholas Piggin
2021-03-20 13:04 ` Michael Ellerman
2021-02-12 0:36 ` Nicholas Piggin
2021-02-19 2:43 ` Daniel Axtens
2 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2021-02-11 23:16 UTC (permalink / raw)
To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar
Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
>
> One example, which was caught in xmon:
>
> [ 14.068327][ T1] devtmpfs: mounted
> [ 14.069302][ T1] Freeing unused kernel memory: 5568K
> [ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
> [ 14.142063][ T1] Run /sbin/init as init process
> [ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
> cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
> pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
> lr: c0000000001862d4: update_rq_clock+0x44/0x110
> sp: c00000000c747880
> msr: 8000000040001031
> current = 0xc00000000c60d380
> paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
> pid = 347, comm = kworker/2:1
> ...
> enter ? for help
> [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
> [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
> [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
> [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
> [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
> [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
> [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
> [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
> [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
> [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
> [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
>
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
>
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.
>
> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
> arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
> 1 file changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/mm.h>
> +#include <linux/stop_machine.h>
>
> #include <asm/sections.h>
> #include <asm/mmu.h>
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> + unsigned long start, end, newpp;
> + unsigned int step, nr_cpus, master_cpu;
> + atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
> static void change_memory_range(unsigned long start, unsigned long end,
> unsigned int step, unsigned long newpp)
> {
> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
> mmu_kernel_ssize);
> }
>
> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
> +{
> + unsigned long msr, tmp, flags;
> + int *p;
> +
> + p = &parms->cpu_counter.counter;
> +
> + local_irq_save(flags);
> + __hard_EE_RI_disable();
> +
> + asm volatile (
> + // Switch to real mode and leave interrupts off
> + "mfmsr %[msr] ;"
> + "li %[tmp], %[MSR_IR_DR] ;"
> + "andc %[tmp], %[msr], %[tmp] ;"
> + "mtmsrd %[tmp] ;"
> +
> + // Tell the master we are in real mode
> + "1: "
> + "lwarx %[tmp], 0, %[p] ;"
> + "addic %[tmp], %[tmp], -1 ;"
> + "stwcx. %[tmp], 0, %[p] ;"
> + "bne- 1b ;"
> +
> + // Spin until the counter goes to zero
> + "2: ;"
> + "lwz %[tmp], 0(%[p]) ;"
> + "cmpwi %[tmp], 0 ;"
> + "bne- 2b ;"
> +
> + // Switch back to virtual mode
> + "mtmsrd %[msr] ;"
> +
> + : // outputs
> + [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
> + : // inputs
> + [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
> + : // clobbers
> + "cc", "xer"
> + );
> +
> + local_irq_restore(flags);
Hmm. __hard_EE_RI_disable won't get restored by this because it doesn't
set the HARD_DIS flag. Also we don't want RI disabled here because
tracing will get called first (which might take SLB or HPTE fault).
But it's also slightly rude to ever enable EE under an irq soft mask,
because you don't know if it had been disabled by the masked interrupt
handler. It's not strictly a problem AFAIK because the interrupt would
just get masked again, but if we try to maintain a good pattern would
be good. Hmm that means we should add a check for irqs soft masked in
__hard_irq_enable(), I'm not sure if all existing users would follow
this rule.
Might be better to call hard_irq_disable(); after the local_irq_save();
and then clear and reset RI inside that region (could just do it at the
same time as disabling MMU).
You could possibly pass old_msr and new_msr into asm directly and do
mfmsr() in C?
Clearing RI here unfortuantely I don't think will prevent interrupt
handlers (sreset or MCE) from trying to go into virtual mode if they
hit here. It only prevents them trying to return.
Thanks,
Nick
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-02-11 23:16 ` Nicholas Piggin
@ 2021-03-20 13:04 ` Michael Ellerman
2021-03-22 2:56 ` Nicholas Piggin
0 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-03-20 13:04 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: aneesh.kumar
Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
...
>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> index 3663d3cdffac..01de985df2c4 100644
>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>> mmu_kernel_ssize);
>> }
>>
>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>> +{
>> + unsigned long msr, tmp, flags;
>> + int *p;
>> +
>> + p = &parms->cpu_counter.counter;
>> +
>> + local_irq_save(flags);
>> + __hard_EE_RI_disable();
>> +
>> + asm volatile (
>> + // Switch to real mode and leave interrupts off
>> + "mfmsr %[msr] ;"
>> + "li %[tmp], %[MSR_IR_DR] ;"
>> + "andc %[tmp], %[msr], %[tmp] ;"
>> + "mtmsrd %[tmp] ;"
>> +
>> + // Tell the master we are in real mode
>> + "1: "
>> + "lwarx %[tmp], 0, %[p] ;"
>> + "addic %[tmp], %[tmp], -1 ;"
>> + "stwcx. %[tmp], 0, %[p] ;"
>> + "bne- 1b ;"
>> +
>> + // Spin until the counter goes to zero
>> + "2: ;"
>> + "lwz %[tmp], 0(%[p]) ;"
>> + "cmpwi %[tmp], 0 ;"
>> + "bne- 2b ;"
>> +
>> + // Switch back to virtual mode
>> + "mtmsrd %[msr] ;"
>> +
>> + : // outputs
>> + [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
>> + : // inputs
>> + [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
>> + : // clobbers
>> + "cc", "xer"
>> + );
>> +
>> + local_irq_restore(flags);
>
> Hmm. __hard_EE_RI_disable won't get restored by this because it doesn't
> set the HARD_DIS flag. Also we don't want RI disabled here because
> tracing will get called first (which might take SLB or HPTE fault).
Thanks for noticing. I originally wrote hard_irq_disable() but then
thought disabling RI also would be good.
> But it's also slightly rude to ever enable EE under an irq soft mask,
> because you don't know if it had been disabled by the masked interrupt
> handler. It's not strictly a problem AFAIK because the interrupt would
> just get masked again, but if we try to maintain a good pattern would
> be good. Hmm that means we should add a check for irqs soft masked in
> __hard_irq_enable(), I'm not sure if all existing users would follow
> this rule.
>
> Might be better to call hard_irq_disable(); after the local_irq_save();
> and then clear and reset RI inside that region (could just do it at the
> same time as disabling MMU).
Thinking about it more, there's no real reason to disable RI.
We should be able to return from an interrupt in there, it's just that
if we do take one we'll probably die before we get a chance to return
because the mapping of text will be missing.
So disabling RI doesn't really gain us anything I don't think.
cheers
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-03-20 13:04 ` Michael Ellerman
@ 2021-03-22 2:56 ` Nicholas Piggin
0 siblings, 0 replies; 22+ messages in thread
From: Nicholas Piggin @ 2021-03-22 2:56 UTC (permalink / raw)
To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar
Excerpts from Michael Ellerman's message of March 20, 2021 11:04 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
> ...
>>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> index 3663d3cdffac..01de985df2c4 100644
>>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>> mmu_kernel_ssize);
>>> }
>>>
>>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>>> +{
>>> + unsigned long msr, tmp, flags;
>>> + int *p;
>>> +
>>> + p = &parms->cpu_counter.counter;
>>> +
>>> + local_irq_save(flags);
>>> + __hard_EE_RI_disable();
>>> +
>>> + asm volatile (
>>> + // Switch to real mode and leave interrupts off
>>> + "mfmsr %[msr] ;"
>>> + "li %[tmp], %[MSR_IR_DR] ;"
>>> + "andc %[tmp], %[msr], %[tmp] ;"
>>> + "mtmsrd %[tmp] ;"
>>> +
>>> + // Tell the master we are in real mode
>>> + "1: "
>>> + "lwarx %[tmp], 0, %[p] ;"
>>> + "addic %[tmp], %[tmp], -1 ;"
>>> + "stwcx. %[tmp], 0, %[p] ;"
>>> + "bne- 1b ;"
>>> +
>>> + // Spin until the counter goes to zero
>>> + "2: ;"
>>> + "lwz %[tmp], 0(%[p]) ;"
>>> + "cmpwi %[tmp], 0 ;"
>>> + "bne- 2b ;"
>>> +
>>> + // Switch back to virtual mode
>>> + "mtmsrd %[msr] ;"
>>> +
>>> + : // outputs
>>> + [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
>>> + : // inputs
>>> + [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
>>> + : // clobbers
>>> + "cc", "xer"
>>> + );
>>> +
>>> + local_irq_restore(flags);
>>
>> Hmm. __hard_EE_RI_disable won't get restored by this because it doesn't
>> set the HARD_DIS flag. Also we don't want RI disabled here because
>> tracing will get called first (which might take SLB or HPTE fault).
>
> Thanks for noticing. I originally wrote hard_irq_disable() but then
> thought disabling RI also would be good.
>
>> But it's also slightly rude to ever enable EE under an irq soft mask,
>> because you don't know if it had been disabled by the masked interrupt
>> handler. It's not strictly a problem AFAIK because the interrupt would
>> just get masked again, but if we try to maintain a good pattern would
>> be good. Hmm that means we should add a check for irqs soft masked in
>> __hard_irq_enable(), I'm not sure if all existing users would follow
>> this rule.
>>
>> Might be better to call hard_irq_disable(); after the local_irq_save();
>> and then clear and reset RI inside that region (could just do it at the
>> same time as disabling MMU).
>
> Thinking about it more, there's no real reason to disable RI.
>
> We should be able to return from an interrupt in there, it's just that
> if we do take one we'll probably die before we get a chance to return
> because the mapping of text will be missing.
Yeah it probably will because the pseries hash machine check handler has
some hacks in it that require turning the MMU on. We might never fix
that if we're moving to radix, but if we did then in theory we'd be able
to take a MCE here and recover.
> So disabling RI doesn't really gain us anything I don't think.
Yeah I probably agree. So local_irq_save(flags); hard_irq_disable();
should do the trick.
Thanks,
Nick
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
2021-02-11 23:16 ` Nicholas Piggin
@ 2021-02-12 0:36 ` Nicholas Piggin
2021-03-16 6:40 ` Michael Ellerman
2021-02-19 2:43 ` Daniel Axtens
2 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2021-02-12 0:36 UTC (permalink / raw)
To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar
Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
>
> One example, which was caught in xmon:
>
> [ 14.068327][ T1] devtmpfs: mounted
> [ 14.069302][ T1] Freeing unused kernel memory: 5568K
> [ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
> [ 14.142063][ T1] Run /sbin/init as init process
> [ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
> cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
> pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
> lr: c0000000001862d4: update_rq_clock+0x44/0x110
> sp: c00000000c747880
> msr: 8000000040001031
> current = 0xc00000000c60d380
> paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
> pid = 347, comm = kworker/2:1
> ...
> enter ? for help
> [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
> [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
> [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
> [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
> [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
> [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
> [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
> [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
> [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
> [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
> [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
>
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
>
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.
>
> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
> arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
> 1 file changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/mm.h>
> +#include <linux/stop_machine.h>
>
> #include <asm/sections.h>
> #include <asm/mmu.h>
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> + unsigned long start, end, newpp;
> + unsigned int step, nr_cpus, master_cpu;
> + atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
> static void change_memory_range(unsigned long start, unsigned long end,
> unsigned int step, unsigned long newpp)
> {
> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
> mmu_kernel_ssize);
> }
>
> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
> +{
> + unsigned long msr, tmp, flags;
> + int *p;
> +
> + p = &parms->cpu_counter.counter;
> +
> + local_irq_save(flags);
> + __hard_EE_RI_disable();
> +
> + asm volatile (
> + // Switch to real mode and leave interrupts off
> + "mfmsr %[msr] ;"
> + "li %[tmp], %[MSR_IR_DR] ;"
> + "andc %[tmp], %[msr], %[tmp] ;"
> + "mtmsrd %[tmp] ;"
> +
> + // Tell the master we are in real mode
> + "1: "
> + "lwarx %[tmp], 0, %[p] ;"
> + "addic %[tmp], %[tmp], -1 ;"
> + "stwcx. %[tmp], 0, %[p] ;"
> + "bne- 1b ;"
> +
> + // Spin until the counter goes to zero
> + "2: ;"
> + "lwz %[tmp], 0(%[p]) ;"
> + "cmpwi %[tmp], 0 ;"
> + "bne- 2b ;"
> +
> + // Switch back to virtual mode
> + "mtmsrd %[msr] ;"
Pity we don't have something that can switch to emergency stack and
so we can write this stuff in C.
How's something like this suit you?
---
arch/powerpc/kernel/misc_64.S | 22 +++++++++++++++++++++
arch/powerpc/kernel/process.c | 37 +++++++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+)
diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
index 070465825c21..5e911d0b0b16 100644
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@ -27,6 +27,28 @@
.text
+#ifdef CONFIG_PPC_BOOK3S_64
+_GLOBAL(__call_realmode)
+ mflr r0
+ std r0,16(r1)
+ stdu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
+ mr r1,r5
+ mtctr r3
+ mr r3,r4
+ mfmsr r4
+ xori r4,r4,(MSR_IR|MSR_DR)
+ mtmsrd r4
+ bctrl
+ mfmsr r4
+ xori r4,r4,(MSR_IR|MSR_DR)
+ mtmsrd r4
+ ld r1,0(r1)
+ ld r0,16(r1)
+ mtlr r0
+ blr
+
+#endif
+
_GLOBAL(call_do_softirq)
mflr r0
std r0,16(r1)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index a66f435dabbf..260d60f665a3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2197,6 +2197,43 @@ void show_stack(struct task_struct *tsk, unsigned long *stack,
put_task_stack(tsk);
}
+#ifdef CONFIG_PPC_BOOK3S_64
+int __call_realmode(int (*fn)(void *arg), void *arg, void *sp);
+
+/* XXX: find a better place for this
+ * Executing C code in real-mode in general Book3S-64 code can only be done
+ * via this function that switches the stack to one inside the real-mode-area,
+ * which may cover only a small first part of real memory on hash guest LPARs.
+ * fn must be NOKPROBES, must not access vmalloc or anything outside the RMA,
+ * probably shouldn't enable the MMU or interrupts, etc, and be very careful
+ * about calling other generic kernel or powerpc functions.
+ */
+int call_realmode(int (*fn)(void *arg), void *arg)
+{
+ unsigned long flags;
+ void *cursp, *emsp;
+ int ret;
+
+ /* Stack switch is only really required for HPT LPAR, but do it for all to help test coverage of tricky code */
+ cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+ emsp = (void *)(local_paca->emergency_sp - THREAD_SIZE);
+
+ /* XXX check_stack_overflow(); */
+
+ if (WARN_ON_ONCE(cursp == emsp))
+ return -EBUSY;
+
+ local_irq_save(flags);
+ hard_irq_disable();
+
+ ret = __call_realmode(fn, arg, emsp);
+
+ local_irq_restore(flags);
+
+ return ret;
+}
+#endif
+
#ifdef CONFIG_PPC64
/* Called with hard IRQs off */
void notrace __ppc64_runlatch_on(void)
--
2.23.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-02-12 0:36 ` Nicholas Piggin
@ 2021-03-16 6:40 ` Michael Ellerman
2021-03-22 3:09 ` Nicholas Piggin
0 siblings, 1 reply; 22+ messages in thread
From: Michael Ellerman @ 2021-03-16 6:40 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: aneesh.kumar
Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>> failures when using the Hash MMU and running under phyp. The crashes
>> are intermittent, and often exhibit as a completely unresponsive
>> system, or possibly an oops.
...
>>
>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> index 3663d3cdffac..01de985df2c4 100644
>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>> mmu_kernel_ssize);
>> }
>>
>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>> +{
>> + unsigned long msr, tmp, flags;
>> + int *p;
>> +
>> + p = &parms->cpu_counter.counter;
>> +
>> + local_irq_save(flags);
>> + __hard_EE_RI_disable();
>> +
>> + asm volatile (
>> + // Switch to real mode and leave interrupts off
>> + "mfmsr %[msr] ;"
>> + "li %[tmp], %[MSR_IR_DR] ;"
>> + "andc %[tmp], %[msr], %[tmp] ;"
>> + "mtmsrd %[tmp] ;"
>> +
>> + // Tell the master we are in real mode
>> + "1: "
>> + "lwarx %[tmp], 0, %[p] ;"
>> + "addic %[tmp], %[tmp], -1 ;"
>> + "stwcx. %[tmp], 0, %[p] ;"
>> + "bne- 1b ;"
>> +
>> + // Spin until the counter goes to zero
>> + "2: ;"
>> + "lwz %[tmp], 0(%[p]) ;"
>> + "cmpwi %[tmp], 0 ;"
>> + "bne- 2b ;"
>> +
>> + // Switch back to virtual mode
>> + "mtmsrd %[msr] ;"
>
> Pity we don't have something that can switch to emergency stack and
> so we can write this stuff in C.
>
> How's something like this suit you?
It looks like it would be really good for writing exploits :)
I think at the very least we would want the asm part to load the SP
from the paca itself, rather than taking it as a parameter.
But I'm not sure writing these type of things in C is a big win, because
you have to be so careful about what you call anyway. It's almost better
in asm because it's so restrictive.
Obviously having said that, my first attempt got the IRQ save/restore
wrong, so maybe we should at least have some macros to help with it.
Did you have another user for this in mind? The only one that I can
think of at the moment is the subcore stuff.
cheers
> diff --git a/arch/powerpc/kernel/misc_64.S b/arch/powerpc/kernel/misc_64.S
> index 070465825c21..5e911d0b0b16 100644
> --- a/arch/powerpc/kernel/misc_64.S
> +++ b/arch/powerpc/kernel/misc_64.S
> @@ -27,6 +27,28 @@
>
> .text
>
> +#ifdef CONFIG_PPC_BOOK3S_64
> +_GLOBAL(__call_realmode)
> + mflr r0
> + std r0,16(r1)
> + stdu r1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r5)
> + mr r1,r5
> + mtctr r3
> + mr r3,r4
> + mfmsr r4
> + xori r4,r4,(MSR_IR|MSR_DR)
> + mtmsrd r4
> + bctrl
> + mfmsr r4
> + xori r4,r4,(MSR_IR|MSR_DR)
> + mtmsrd r4
> + ld r1,0(r1)
> + ld r0,16(r1)
> + mtlr r0
> + blr
> +
> +#endif
> +
> _GLOBAL(call_do_softirq)
> mflr r0
> std r0,16(r1)
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index a66f435dabbf..260d60f665a3 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -2197,6 +2197,43 @@ void show_stack(struct task_struct *tsk, unsigned long *stack,
> put_task_stack(tsk);
> }
>
> +#ifdef CONFIG_PPC_BOOK3S_64
> +int __call_realmode(int (*fn)(void *arg), void *arg, void *sp);
> +
> +/* XXX: find a better place for this
> + * Executing C code in real-mode in general Book3S-64 code can only be done
> + * via this function that switches the stack to one inside the real-mode-area,
> + * which may cover only a small first part of real memory on hash guest LPARs.
> + * fn must be NOKPROBES, must not access vmalloc or anything outside the RMA,
> + * probably shouldn't enable the MMU or interrupts, etc, and be very careful
> + * about calling other generic kernel or powerpc functions.
> + */
> +int call_realmode(int (*fn)(void *arg), void *arg)
> +{
> + unsigned long flags;
> + void *cursp, *emsp;
> + int ret;
> +
> + /* Stack switch is only really required for HPT LPAR, but do it for all to help test coverage of tricky code */
> + cursp = (void *)(current_stack_pointer & ~(THREAD_SIZE - 1));
> + emsp = (void *)(local_paca->emergency_sp - THREAD_SIZE);
> +
> + /* XXX check_stack_overflow(); */
> +
> + if (WARN_ON_ONCE(cursp == emsp))
> + return -EBUSY;
> +
> + local_irq_save(flags);
> + hard_irq_disable();
> +
> + ret = __call_realmode(fn, arg, emsp);
> +
> + local_irq_restore(flags);
> +
> + return ret;
> +}
> +#endif
> +
> #ifdef CONFIG_PPC64
> /* Called with hard IRQs off */
> void notrace __ppc64_runlatch_on(void)
> --
> 2.23.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-03-16 6:40 ` Michael Ellerman
@ 2021-03-22 3:09 ` Nicholas Piggin
2021-03-22 9:07 ` Michael Ellerman
0 siblings, 1 reply; 22+ messages in thread
From: Nicholas Piggin @ 2021-03-22 3:09 UTC (permalink / raw)
To: linuxppc-dev, Michael Ellerman; +Cc: aneesh.kumar
Excerpts from Michael Ellerman's message of March 16, 2021 4:40 pm:
> Nicholas Piggin <npiggin@gmail.com> writes:
>> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
>>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>>> failures when using the Hash MMU and running under phyp. The crashes
>>> are intermittent, and often exhibit as a completely unresponsive
>>> system, or possibly an oops.
> ...
>>>
>>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> index 3663d3cdffac..01de985df2c4 100644
>>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>> mmu_kernel_ssize);
>>> }
>>>
>>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>>> +{
>>> + unsigned long msr, tmp, flags;
>>> + int *p;
>>> +
>>> + p = &parms->cpu_counter.counter;
>>> +
>>> + local_irq_save(flags);
>>> + __hard_EE_RI_disable();
>>> +
>>> + asm volatile (
>>> + // Switch to real mode and leave interrupts off
>>> + "mfmsr %[msr] ;"
>>> + "li %[tmp], %[MSR_IR_DR] ;"
>>> + "andc %[tmp], %[msr], %[tmp] ;"
>>> + "mtmsrd %[tmp] ;"
>>> +
>>> + // Tell the master we are in real mode
>>> + "1: "
>>> + "lwarx %[tmp], 0, %[p] ;"
>>> + "addic %[tmp], %[tmp], -1 ;"
>>> + "stwcx. %[tmp], 0, %[p] ;"
>>> + "bne- 1b ;"
>>> +
>>> + // Spin until the counter goes to zero
>>> + "2: ;"
>>> + "lwz %[tmp], 0(%[p]) ;"
>>> + "cmpwi %[tmp], 0 ;"
>>> + "bne- 2b ;"
>>> +
>>> + // Switch back to virtual mode
>>> + "mtmsrd %[msr] ;"
>>
>> Pity we don't have something that can switch to emergency stack and
>> so we can write this stuff in C.
>>
>> How's something like this suit you?
>
> It looks like it would be really good for writing exploits :)
Hmm. In that case maybe the callee function could be inlined into it
like the interrupt wrappers, and the asm real-mode entry/exit gets
added around it rather than have this little exploit stub. So similar to
yours but with a stack switch as well so you can come back up in real
mode.
> I think at the very least we would want the asm part to load the SP
> from the paca itself, rather than taking it as a parameter.
>
> But I'm not sure writing these type of things in C is a big win, because
> you have to be so careful about what you call anyway. It's almost better
> in asm because it's so restrictive.
>
> Obviously having said that, my first attempt got the IRQ save/restore
> wrong, so maybe we should at least have some macros to help with it.
>
> Did you have another user for this in mind? The only one that I can
> think of at the moment is the subcore stuff.
Possibly rtas entry/exit (although that has other issues). But I guess
it's not a huge amount of asm compared with what I'm dealing with.
I'm okay if you just put your thing in at the moment, we might or might
not get keen and c-ify it later.
Thanks,
Nick
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-03-22 3:09 ` Nicholas Piggin
@ 2021-03-22 9:07 ` Michael Ellerman
0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-03-22 9:07 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: aneesh.kumar
Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Michael Ellerman's message of March 16, 2021 4:40 pm:
>> Nicholas Piggin <npiggin@gmail.com> writes:
>>> Excerpts from Michael Ellerman's message of February 11, 2021 11:51 pm:
>>>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>>>> failures when using the Hash MMU and running under phyp. The crashes
>>>> are intermittent, and often exhibit as a completely unresponsive
>>>> system, or possibly an oops.
>> ...
>>>>
>>>> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>>> index 3663d3cdffac..01de985df2c4 100644
>>>> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
>>>> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
>>>> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
>>>> mmu_kernel_ssize);
>>>> }
>>>>
>>>> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
>>>> +{
>>>> + unsigned long msr, tmp, flags;
>>>> + int *p;
>>>> +
>>>> + p = &parms->cpu_counter.counter;
>>>> +
>>>> + local_irq_save(flags);
>>>> + __hard_EE_RI_disable();
>>>> +
>>>> + asm volatile (
>>>> + // Switch to real mode and leave interrupts off
>>>> + "mfmsr %[msr] ;"
>>>> + "li %[tmp], %[MSR_IR_DR] ;"
>>>> + "andc %[tmp], %[msr], %[tmp] ;"
>>>> + "mtmsrd %[tmp] ;"
>>>> +
>>>> + // Tell the master we are in real mode
>>>> + "1: "
>>>> + "lwarx %[tmp], 0, %[p] ;"
>>>> + "addic %[tmp], %[tmp], -1 ;"
>>>> + "stwcx. %[tmp], 0, %[p] ;"
>>>> + "bne- 1b ;"
>>>> +
>>>> + // Spin until the counter goes to zero
>>>> + "2: ;"
>>>> + "lwz %[tmp], 0(%[p]) ;"
>>>> + "cmpwi %[tmp], 0 ;"
>>>> + "bne- 2b ;"
>>>> +
>>>> + // Switch back to virtual mode
>>>> + "mtmsrd %[msr] ;"
>>>
>>> Pity we don't have something that can switch to emergency stack and
>>> so we can write this stuff in C.
>>>
>>> How's something like this suit you?
>>
>> It looks like it would be really good for writing exploits :)
>
> Hmm. In that case maybe the callee function could be inlined into it
> like the interrupt wrappers, and the asm real-mode entry/exit gets
> added around it rather than have this little exploit stub. So similar to
> yours but with a stack switch as well so you can come back up in real
> mode.
Yeah inlining as much as possible would reduce the risk.
>> I think at the very least we would want the asm part to load the SP
>> from the paca itself, rather than taking it as a parameter.
>>
>> But I'm not sure writing these type of things in C is a big win, because
>> you have to be so careful about what you call anyway. It's almost better
>> in asm because it's so restrictive.
>>
>> Obviously having said that, my first attempt got the IRQ save/restore
>> wrong, so maybe we should at least have some macros to help with it.
>>
>> Did you have another user for this in mind? The only one that I can
>> think of at the moment is the subcore stuff.
>
> Possibly rtas entry/exit (although that has other issues). But I guess
> it's not a huge amount of asm compared with what I'm dealing with.
Ah yep, I hadn't thought of RTAS.
> I'm okay if you just put your thing in at the moment, we might or might
> not get keen and c-ify it later.
OK.
cheers
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-02-11 13:51 ` [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR Michael Ellerman
2021-02-11 23:16 ` Nicholas Piggin
2021-02-12 0:36 ` Nicholas Piggin
@ 2021-02-19 2:43 ` Daniel Axtens
2021-03-19 11:56 ` Michael Ellerman
2 siblings, 1 reply; 22+ messages in thread
From: Daniel Axtens @ 2021-02-19 2:43 UTC (permalink / raw)
To: Michael Ellerman, linuxppc-dev; +Cc: aneesh.kumar
Michael Ellerman <mpe@ellerman.id.au> writes:
> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
>
> One example, which was caught in xmon:
>
> [ 14.068327][ T1] devtmpfs: mounted
> [ 14.069302][ T1] Freeing unused kernel memory: 5568K
> [ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
> [ 14.142063][ T1] Run /sbin/init as init process
> [ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
> cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
> pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
> lr: c0000000001862d4: update_rq_clock+0x44/0x110
> sp: c00000000c747880
> msr: 8000000040001031
> current = 0xc00000000c60d380
> paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
> pid = 347, comm = kworker/2:1
> ...
> enter ? for help
> [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
> [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
> [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
> [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
> [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
> [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
> [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
> [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
> [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
> [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
> [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
>
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
>
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.
Jordan asked why we saw this on phyp but not under KVM? We had a look at
book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
about!
Nick suggests that the KVM hypervisor is invalidating the HPTE, but
because we run guests in VPM mode, the hypervisor would catch the page
fault and not reflect it down to the guest. It looks like Linux-as-a-HV
will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
hypervisor will catch the fault and not pass it to the guest. But if
phyp runs with VPM mode off, the guest will see the fault before the
hypervisor. (we think this is what's going on anyway.)
We spent a while pondering if phyp is doing something buggy or not...
Looking at the PAPR definition of H_PROTECT, that claims the hypervisor
will do the 'architected “Modifying a Page Table Entry General Case”
sequence'. s 5.10.1.2 of Book IIIS of the ISAv3 defines that, and the
non-atomic hardware sequence does indeed modify the PTE by going through
the invalid state. So it looks like if phyp is running without VPM mode
it's technically not buggy.
Hopefully I'll get to have a look at the rest of the patch shortly!
Kind regards,
Daniel
> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
> arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++++++++++++++++++++++-
> 1 file changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
> #include <linux/sched.h>
> #include <linux/mm_types.h>
> #include <linux/mm.h>
> +#include <linux/stop_machine.h>
>
> #include <asm/sections.h>
> #include <asm/mmu.h>
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> + unsigned long start, end, newpp;
> + unsigned int step, nr_cpus, master_cpu;
> + atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
> static void change_memory_range(unsigned long start, unsigned long end,
> unsigned int step, unsigned long newpp)
> {
> @@ -414,6 +428,73 @@ static void change_memory_range(unsigned long start, unsigned long end,
> mmu_kernel_ssize);
> }
>
> +static int notrace chmem_secondary_loop(struct change_memory_parms *parms)
> +{
> + unsigned long msr, tmp, flags;
> + int *p;
> +
> + p = &parms->cpu_counter.counter;
> +
> + local_irq_save(flags);
> + __hard_EE_RI_disable();
> +
> + asm volatile (
> + // Switch to real mode and leave interrupts off
> + "mfmsr %[msr] ;"
> + "li %[tmp], %[MSR_IR_DR] ;"
> + "andc %[tmp], %[msr], %[tmp] ;"
> + "mtmsrd %[tmp] ;"
> +
> + // Tell the master we are in real mode
> + "1: "
> + "lwarx %[tmp], 0, %[p] ;"
> + "addic %[tmp], %[tmp], -1 ;"
> + "stwcx. %[tmp], 0, %[p] ;"
> + "bne- 1b ;"
> +
> + // Spin until the counter goes to zero
> + "2: ;"
> + "lwz %[tmp], 0(%[p]) ;"
> + "cmpwi %[tmp], 0 ;"
> + "bne- 2b ;"
> +
> + // Switch back to virtual mode
> + "mtmsrd %[msr] ;"
> +
> + : // outputs
> + [msr] "=&r" (msr), [tmp] "=&b" (tmp), "+m" (*p)
> + : // inputs
> + [p] "b" (p), [MSR_IR_DR] "i" (MSR_IR | MSR_DR)
> + : // clobbers
> + "cc", "xer"
> + );
> +
> + local_irq_restore(flags);
> +
> + return 0;
> +}
> +
> +static int change_memory_range_fn(void *data)
> +{
> + struct change_memory_parms *parms = data;
> +
> + if (parms->master_cpu != smp_processor_id())
> + return chmem_secondary_loop(parms);
> +
> + // Wait for all but one CPU (this one) to call-in
> + while (atomic_read(&parms->cpu_counter) > 1)
> + barrier();
> +
> + change_memory_range(parms->start, parms->end, parms->step, parms->newpp);
> +
> + mb();
> +
> + // Signal the other CPUs that we're done
> + atomic_dec(&parms->cpu_counter);
> +
> + return 0;
> +}
> +
> static bool hash__change_memory_range(unsigned long start, unsigned long end,
> unsigned long newpp)
> {
> @@ -428,7 +509,29 @@ static bool hash__change_memory_range(unsigned long start, unsigned long end,
> if (start >= end)
> return false;
>
> - change_memory_range(start, end, step, newpp);
> + if (firmware_has_feature(FW_FEATURE_LPAR)) {
> + mutex_lock(&chmem_lock);
> +
> + chmem_parms.start = start;
> + chmem_parms.end = end;
> + chmem_parms.step = step;
> + chmem_parms.newpp = newpp;
> + chmem_parms.master_cpu = smp_processor_id();
> +
> + cpus_read_lock();
> +
> + atomic_set(&chmem_parms.cpu_counter, num_online_cpus());
> +
> + // Ensure state is consistent before we call the other CPUs
> + mb();
> +
> + stop_machine_cpuslocked(change_memory_range_fn, &chmem_parms,
> + cpu_online_mask);
> +
> + cpus_read_unlock();
> + mutex_unlock(&chmem_lock);
> + } else
> + change_memory_range(start, end, step, newpp);
>
> return true;
> }
> --
> 2.25.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR
2021-02-19 2:43 ` Daniel Axtens
@ 2021-03-19 11:56 ` Michael Ellerman
0 siblings, 0 replies; 22+ messages in thread
From: Michael Ellerman @ 2021-03-19 11:56 UTC (permalink / raw)
To: Daniel Axtens, linuxppc-dev; +Cc: aneesh.kumar
Daniel Axtens <dja@axtens.net> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>
>> When we enabled STRICT_KERNEL_RWX we received some reports of boot
>> failures when using the Hash MMU and running under phyp. The crashes
>> are intermittent, and often exhibit as a completely unresponsive
>> system, or possibly an oops.
>>
>> One example, which was caught in xmon:
>>
>> [ 14.068327][ T1] devtmpfs: mounted
>> [ 14.069302][ T1] Freeing unused kernel memory: 5568K
>> [ 14.142060][ T347] BUG: Unable to handle kernel instruction fetch
>> [ 14.142063][ T1] Run /sbin/init as init process
>> [ 14.142074][ T347] Faulting instruction address: 0xc000000000004400
>> cpu 0x2: Vector: 400 (Instruction Access) at [c00000000c7475e0]
>> pc: c000000000004400: exc_virt_0x4400_instruction_access+0x0/0x80
>> lr: c0000000001862d4: update_rq_clock+0x44/0x110
>> sp: c00000000c747880
>> msr: 8000000040001031
>> current = 0xc00000000c60d380
>> paca = 0xc00000001ec9de80 irqmask: 0x03 irq_happened: 0x01
>> pid = 347, comm = kworker/2:1
>> ...
>> enter ? for help
>> [c00000000c747880] c0000000001862d4 update_rq_clock+0x44/0x110 (unreliable)
>> [c00000000c7478f0] c000000000198794 update_blocked_averages+0xb4/0x6d0
>> [c00000000c7479f0] c000000000198e40 update_nohz_stats+0x90/0xd0
>> [c00000000c747a20] c0000000001a13b4 _nohz_idle_balance+0x164/0x390
>> [c00000000c747b10] c0000000001a1af8 newidle_balance+0x478/0x610
>> [c00000000c747be0] c0000000001a1d48 pick_next_task_fair+0x58/0x480
>> [c00000000c747c40] c000000000eaab5c __schedule+0x12c/0x950
>> [c00000000c747cd0] c000000000eab3e8 schedule+0x68/0x120
>> [c00000000c747d00] c00000000016b730 worker_thread+0x130/0x640
>> [c00000000c747da0] c000000000174d50 kthread+0x1a0/0x1b0
>> [c00000000c747e10] c00000000000e0f0 ret_from_kernel_thread+0x5c/0x6c
>>
>> This shows that CPU 2, which was idle, woke up and then appears to
>> randomly take an instruction fault on a completely valid area of
>> kernel text.
>>
>> The cause turns out to be the call to hash__mark_rodata_ro(), late in
>> boot. Due to the way we layout text and rodata, that function actually
>> changes the permissions for all of text and rodata to read-only plus
>> execute.
>>
>> To do the permission change we use a hypervisor call, H_PROTECT. On
>> phyp that appears to be implemented by briefly removing the mapping of
>> the kernel text, before putting it back with the updated permissions.
>> If any other CPU is executing during that window, it will see spurious
>> faults on the kernel text and/or data, leading to crashes.
>
> Jordan asked why we saw this on phyp but not under KVM? We had a look at
> book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
> about!
>
> Nick suggests that the KVM hypervisor is invalidating the HPTE, but
> because we run guests in VPM mode, the hypervisor would catch the page
> fault and not reflect it down to the guest. It looks like Linux-as-a-HV
> will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
> hypervisor will catch the fault and not pass it to the guest.
Yep.
> But if phyp runs with VPM mode off, the guest will see the fault
> before the hypervisor. (we think this is what's going on anyway.)
Yeah. I assumed phyp always ran with VPM=1, but apparently it can run
with it off or on, depending on various configuration settings.
So I'm fairly sure what we're hitting here is VPM=0, where the faults go
straight to the guest.
cheers
^ permalink raw reply [flat|nested] 22+ messages in thread