All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] x86/mm/pti: Robustness updates
@ 2019-08-28 14:24 Thomas Gleixner
  2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 14:24 UTC (permalink / raw)
  To: LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

Following up on the discussions around the patch Song submitted to 'cure' a
iTLB related performance regression, I picked up Song's patch which makes
clone_page_tables() more robust by handling unaligned addresses proper and
added one which prevents calling into the PTI code when PTI is enabled
compile time, but disabled at runtime (command line or CPU not affected).

There is no point in calling into those PTI functions unconditionally. The
resulting page tables are the same before and after the change which makes
sense as the code clones the kernel page table into the secondary page
table space which is available but not used when PTI is boot time disabled.

But even if it does not do damage today, this could have nasty side effect
when the PTI code is changed, extended etc. later.

Thanks,

	tglx
8<----------------
 pti.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)




^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 14:24 [patch 0/2] x86/mm/pti: Robustness updates Thomas Gleixner
@ 2019-08-28 14:24 ` Thomas Gleixner
  2019-08-28 15:46   ` Dave Hansen
                     ` (2 more replies)
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
  2019-08-28 16:03 ` [patch 0/2] x86/mm/pti: Robustness updates Peter Zijlstra
  2 siblings, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 14:24 UTC (permalink / raw)
  To: LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

From: Song Liu <songliubraving@fb.com>

pti_clone_pmds() assumes that the supplied address is either:

 - properly PUD/PMD aligned
or
 - the address is actually mapped which means that independent
   of the mapping level (PUD/PMD/PTE) the next higher mapping
   exist.

If that's not the case the unaligned address can be incremented by PUD or
PMD size wrongly. All callers supply mapped and/or aligned addresses, but
for robustness sake, it's better to handle that case proper and to emit a
warning.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190820202314.1083149-1-songliubraving@fb.com

---
 arch/x86/mm/pti.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
-			addr += PUD_SIZE;
+			WARN_ON_ONCE(addr & PUD_MASK);
+			addr = round_up(addr + 1, PUD_SIZE);
 			continue;
 		}
 
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
-			addr += PMD_SIZE;
+			WARN_ON_ONCE(addr & PMD_MASK);
+			addr = round_up(addr + 1, PMD_SIZE);
 			continue;
 		}
 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  2019-08-28 14:24 [patch 0/2] x86/mm/pti: Robustness updates Thomas Gleixner
  2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
@ 2019-08-28 14:24 ` Thomas Gleixner
  2019-08-28 15:47   ` Dave Hansen
                     ` (4 more replies)
  2019-08-28 16:03 ` [patch 0/2] x86/mm/pti: Robustness updates Peter Zijlstra
  2 siblings, 5 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 14:24 UTC (permalink / raw)
  To: LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

When PTI is disabled at boot time either because the CPU is not affected or
PTI has been disabled on the command line, the boot code still calls into
pti_finalize() which then unconditionally invokes:

     pti_clone_entry_text()
     pti_clone_kernel_text()

pti_clone_kernel_text() was called unconditionally before the 32bit support
was added and 32bit added the call to pti_clone_entry_text().

The call has no side effects as cloning the page tables into the available
second one, which was allocated for PTI does not create damage. But it does
not make sense either and in case that this functionality would be extended
later this might actually lead to hard to diagnose issue.

Neither function should be called when PTI is runtime disabled. Make the
invocation conditional.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/mm/pti.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -668,6 +668,8 @@ void __init pti_init(void)
  */
 void pti_finalize(void)
 {
+	if (!boot_cpu_has(X86_FEATURE_PTI))
+		return;
 	/*
 	 * We need to clone everything (again) that maps parts of the
 	 * kernel image.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
@ 2019-08-28 15:46   ` Dave Hansen
  2019-08-28 15:51     ` Thomas Gleixner
  2019-08-28 18:58   ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Ingo Molnar
  2019-08-28 20:54   ` [patch V2 " Thomas Gleixner
  2 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2019-08-28 15:46 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Song Liu, Joerg Roedel, Andy Lutomirski, Peter Zijlstra,
	Rik van Riel

On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> From: Song Liu <songliubraving@fb.com>
> 
> pti_clone_pmds() assumes that the supplied address is either:
> 
>  - properly PUD/PMD aligned
> or
>  - the address is actually mapped which means that independent
>    of the mapping level (PUD/PMD/PTE) the next higher mapping
>    exist.
> 
> If that's not the case the unaligned address can be incremented by PUD or
> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> for robustness sake, it's better to handle that case proper and to emit a
> warning.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

Song, did you ever root-cause the performance regression?  I thought
there were still some mysteries there.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
@ 2019-08-28 15:47   ` Dave Hansen
  2019-08-28 17:49   ` Song Liu
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2019-08-28 15:47 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> Neither function should be called when PTI is runtime disabled. Make the
> invocation conditional.

Thanks for sending that out.  My impressions from a look through it
matched your changelog.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 15:46   ` Dave Hansen
@ 2019-08-28 15:51     ` Thomas Gleixner
  2019-08-28 17:58       ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 15:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, x86, Song Liu, Joerg Roedel, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On Wed, 28 Aug 2019, Dave Hansen wrote:
> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> > From: Song Liu <songliubraving@fb.com>
> > 
> > pti_clone_pmds() assumes that the supplied address is either:
> > 
> >  - properly PUD/PMD aligned
> > or
> >  - the address is actually mapped which means that independent
> >    of the mapping level (PUD/PMD/PTE) the next higher mapping
> >    exist.
> > 
> > If that's not the case the unaligned address can be incremented by PUD or
> > PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> > for robustness sake, it's better to handle that case proper and to emit a
> > warning.
> 
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Song, did you ever root-cause the performance regression?  I thought
> there were still some mysteries there.

See Peter's series to rework the ftrace code patching ...


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 0/2] x86/mm/pti: Robustness updates
  2019-08-28 14:24 [patch 0/2] x86/mm/pti: Robustness updates Thomas Gleixner
  2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
@ 2019-08-28 16:03 ` Peter Zijlstra
  2 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-08-28 16:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Rik van Riel

On Wed, Aug 28, 2019 at 04:24:45PM +0200, Thomas Gleixner wrote:
> Following up on the discussions around the patch Song submitted to 'cure' a
> iTLB related performance regression, I picked up Song's patch which makes
> clone_page_tables() more robust by handling unaligned addresses proper and
> added one which prevents calling into the PTI code when PTI is enabled
> compile time, but disabled at runtime (command line or CPU not affected).
> 
> There is no point in calling into those PTI functions unconditionally. The
> resulting page tables are the same before and after the change which makes
> sense as the code clones the kernel page table into the secondary page
> table space which is available but not used when PTI is boot time disabled.
> 
> But even if it does not do damage today, this could have nasty side effect
> when the PTI code is changed, extended etc. later.
> 

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
  2019-08-28 15:47   ` Dave Hansen
@ 2019-08-28 17:49   ` Song Liu
  2019-08-28 19:00   ` Ingo Molnar
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-08-28 17:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel



> On Aug 28, 2019, at 7:24 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> When PTI is disabled at boot time either because the CPU is not affected or
> PTI has been disabled on the command line, the boot code still calls into
> pti_finalize() which then unconditionally invokes:
> 
>     pti_clone_entry_text()
>     pti_clone_kernel_text()
> 
> pti_clone_kernel_text() was called unconditionally before the 32bit support
> was added and 32bit added the call to pti_clone_entry_text().
> 
> The call has no side effects as cloning the page tables into the available
> second one, which was allocated for PTI does not create damage. But it does
> not make sense either and in case that this functionality would be extended
> later this might actually lead to hard to diagnose issue.
> 
> Neither function should be called when PTI is runtime disabled. Make the
> invocation conditional.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Acked-by: Song Liu <songliubraving@fb.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 15:51     ` Thomas Gleixner
@ 2019-08-28 17:58       ` Song Liu
  2019-08-28 20:05         ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-08-28 17:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, LKML, x86, Joerg Roedel, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel



> On Aug 28, 2019, at 8:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, 28 Aug 2019, Dave Hansen wrote:
>> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
>>> From: Song Liu <songliubraving@fb.com>
>>> 
>>> pti_clone_pmds() assumes that the supplied address is either:
>>> 
>>> - properly PUD/PMD aligned
>>> or
>>> - the address is actually mapped which means that independent
>>>   of the mapping level (PUD/PMD/PTE) the next higher mapping
>>>   exist.
>>> 
>>> If that's not the case the unaligned address can be incremented by PUD or
>>> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
>>> for robustness sake, it's better to handle that case proper and to emit a
>>> warning.
>> 
>> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>> 
>> Song, did you ever root-cause the performance regression?  I thought
>> there were still some mysteries there.
> 
> See Peter's series to rework the ftrace code patching ...

Thanks Thomas. 

Yes, in summary, enabling ftrace or kprobe-on-ftrace causes the kernel
to split PMDs in kernel text mapping. 

Related question: while Peter's patches fix it for 5.3 kernel, they don't 
apply cleanly over 5.2 kernel (which we are using). So I wonder what is
the best solution for 5.2 kernel. May patch also fixes the issue:

https://lore.kernel.org/lkml/20190823052335.572133-1-songliubraving@fb.com/

How about we apply this patch to upstream 5.2 kernel?

Thanks,
Song

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
  2019-08-28 15:46   ` Dave Hansen
@ 2019-08-28 18:58   ` Ingo Molnar
  2019-08-28 19:45     ` Thomas Gleixner
  2019-08-28 20:54   ` [patch V2 " Thomas Gleixner
  2 siblings, 1 reply; 26+ messages in thread
From: Ingo Molnar @ 2019-08-28 18:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel


* Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Song Liu <songliubraving@fb.com>
> 
> pti_clone_pmds() assumes that the supplied address is either:
> 
>  - properly PUD/PMD aligned
> or
>  - the address is actually mapped which means that independent
>    of the mapping level (PUD/PMD/PTE) the next higher mapping
>    exist.

s/independent
 /independently

s/exist
 /exists

> If that's not the case the unaligned address can be incremented by PUD or
> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> for robustness sake, it's better to handle that case proper and to emit a
> warning.

s/wrongly
 /incorrectly

s/robustness sake
 /robustness's sake

s/proper
 /properly

With that:

>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			WARN_ON_ONCE(addr & PUD_MASK);
> +			addr = round_up(addr + 1, PUD_SIZE);
>  			continue;
>  		}
>  
>  		pmd = pmd_offset(pud, addr);
>  		if (pmd_none(*pmd)) {
> -			addr += PMD_SIZE;
> +			WARN_ON_ONCE(addr & PMD_MASK);
> +			addr = round_up(addr + 1, PMD_SIZE);

So given that PUD_MASK and PMD_MASK are masking out the *offset*:

 arch/x86/include/asm/pgtable_64_types.h:#define PMD_MASK	(~(PMD_SIZE - 1))

Didn't we want something like:

			WARN_ON_ONCE(addr & ~PUD_MASK);

			WARN_ON_ONCE(addr & ~PMD_MASK);

to warn about an unaligned 'addr', or am I misreading the intent here?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
  2019-08-28 15:47   ` Dave Hansen
  2019-08-28 17:49   ` Song Liu
@ 2019-08-28 19:00   ` Ingo Molnar
  2019-08-29 19:02   ` [tip: x86/pti] " tip-bot2 for Thomas Gleixner
  2019-08-30 10:25   ` [patch 2/2] " Joerg Roedel
  4 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2019-08-28 19:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel


* Thomas Gleixner <tglx@linutronix.de> wrote:

> When PTI is disabled at boot time either because the CPU is not affected or
> PTI has been disabled on the command line, the boot code still calls into
> pti_finalize() which then unconditionally invokes:
> 
>      pti_clone_entry_text()
>      pti_clone_kernel_text()
> 
> pti_clone_kernel_text() was called unconditionally before the 32bit support
> was added and 32bit added the call to pti_clone_entry_text().
> 
> The call has no side effects as cloning the page tables into the available
> second one, which was allocated for PTI does not create damage. But it does
> not make sense either and in case that this functionality would be extended
> later this might actually lead to hard to diagnose issue.

s/issue/issues

> Neither function should be called when PTI is runtime disabled. Make the
> invocation conditional.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/mm/pti.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -668,6 +668,8 @@ void __init pti_init(void)
>   */
>  void pti_finalize(void)
>  {
> +	if (!boot_cpu_has(X86_FEATURE_PTI))
> +		return;
>  	/*
>  	 * We need to clone everything (again) that maps parts of the
>  	 * kernel image.

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 18:58   ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Ingo Molnar
@ 2019-08-28 19:45     ` Thomas Gleixner
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 19:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On Wed, 28 Aug 2019, Ingo Molnar wrote:
> >  		pmd = pmd_offset(pud, addr);
> >  		if (pmd_none(*pmd)) {
> > -			addr += PMD_SIZE;
> > +			WARN_ON_ONCE(addr & PMD_MASK);
> > +			addr = round_up(addr + 1, PMD_SIZE);
> 
> So given that PUD_MASK and PMD_MASK are masking out the *offset*:
> 
>  arch/x86/include/asm/pgtable_64_types.h:#define PMD_MASK	(~(PMD_SIZE - 1))
> 
> Didn't we want something like:
> 
> 			WARN_ON_ONCE(addr & ~PUD_MASK);
> 
> 			WARN_ON_ONCE(addr & ~PMD_MASK);
> 
> to warn about an unaligned 'addr', or am I misreading the intent here?

Bah, right you are...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 17:58       ` Song Liu
@ 2019-08-28 20:05         ` Thomas Gleixner
  2019-08-28 20:32           ` Song Liu
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 20:05 UTC (permalink / raw)
  To: Song Liu
  Cc: Dave Hansen, LKML, x86, Joerg Roedel, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On Wed, 28 Aug 2019, Song Liu wrote:
> > On Aug 28, 2019, at 8:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > On Wed, 28 Aug 2019, Dave Hansen wrote:
> >> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> >>> From: Song Liu <songliubraving@fb.com>
> >>> 
> >>> pti_clone_pmds() assumes that the supplied address is either:
> >>> 
> >>> - properly PUD/PMD aligned
> >>> or
> >>> - the address is actually mapped which means that independent
> >>>   of the mapping level (PUD/PMD/PTE) the next higher mapping
> >>>   exist.
> >>> 
> >>> If that's not the case the unaligned address can be incremented by PUD or
> >>> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> >>> for robustness sake, it's better to handle that case proper and to emit a
> >>> warning.
> >> 
> >> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> >> 
> >> Song, did you ever root-cause the performance regression?  I thought
> >> there were still some mysteries there.
> > 
> > See Peter's series to rework the ftrace code patching ...
> 
> Thanks Thomas. 
> 
> Yes, in summary, enabling ftrace or kprobe-on-ftrace causes the kernel
> to split PMDs in kernel text mapping. 
> 
> Related question: while Peter's patches fix it for 5.3 kernel, they don't 
> apply cleanly over 5.2 kernel (which we are using). So I wonder what is
> the best solution for 5.2 kernel. May patch also fixes the issue:
> 
> https://lore.kernel.org/lkml/20190823052335.572133-1-songliubraving@fb.com/
> 
> How about we apply this patch to upstream 5.2 kernel?

That's not how it works. We fix stuff upstream and it gets backported to
all affected kernels not just to the one you care about.

Aside of that I really disagree with that hack. You completely fail to
explain why that commit in question broke it and instead of fixing the
underlying issue you create a horrible workaround.

It took me ~10 minutes to analyze the root cause and I'm just booting the
test box with a proper fix which can be actually tagged for stable and can
be removed from upstream again once ftrace got moved over to text poke.

I'll post it once it's confirmed to work and I wrote a comprehensible
changelog.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 20:05         ` Thomas Gleixner
@ 2019-08-28 20:32           ` Song Liu
  2019-08-28 22:31             ` [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Song Liu @ 2019-08-28 20:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, LKML, x86, Joerg Roedel, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel



> On Aug 28, 2019, at 1:05 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, 28 Aug 2019, Song Liu wrote:
>>> On Aug 28, 2019, at 8:51 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Wed, 28 Aug 2019, Dave Hansen wrote:
>>>> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
>>>>> From: Song Liu <songliubraving@fb.com>
>>>>> 
>>>>> pti_clone_pmds() assumes that the supplied address is either:
>>>>> 
>>>>> - properly PUD/PMD aligned
>>>>> or
>>>>> - the address is actually mapped which means that independent
>>>>>  of the mapping level (PUD/PMD/PTE) the next higher mapping
>>>>>  exist.
>>>>> 
>>>>> If that's not the case the unaligned address can be incremented by PUD or
>>>>> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
>>>>> for robustness sake, it's better to handle that case proper and to emit a
>>>>> warning.
>>>> 
>>>> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
>>>> 
>>>> Song, did you ever root-cause the performance regression?  I thought
>>>> there were still some mysteries there.
>>> 
>>> See Peter's series to rework the ftrace code patching ...
>> 
>> Thanks Thomas. 
>> 
>> Yes, in summary, enabling ftrace or kprobe-on-ftrace causes the kernel
>> to split PMDs in kernel text mapping. 
>> 
>> Related question: while Peter's patches fix it for 5.3 kernel, they don't 
>> apply cleanly over 5.2 kernel (which we are using). So I wonder what is
>> the best solution for 5.2 kernel. May patch also fixes the issue:
>> 
>> https://lore.kernel.org/lkml/20190823052335.572133-1-songliubraving@fb.com/
>> 
>> How about we apply this patch to upstream 5.2 kernel?
> 
> That's not how it works. We fix stuff upstream and it gets backported to
> all affected kernels not just to the one you care about.

Agreed. I am trying to back port Peter's patch set to 5.2 kernel. There 
are 9 dependencies and some manual changes. 

> 
> Aside of that I really disagree with that hack. You completely fail to
> explain why that commit in question broke it and instead of fixing the
> underlying issue you create a horrible workaround.
> 
> It took me ~10 minutes to analyze the root cause and I'm just booting the
> test box with a proper fix which can be actually tagged for stable and can
> be removed from upstream again once ftrace got moved over to text poke.
> 
> I'll post it once it's confirmed to work and I wrote a comprehensible
> changelog.

This sounds great. Thanks!

Song


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch V2 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
  2019-08-28 15:46   ` Dave Hansen
  2019-08-28 18:58   ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Ingo Molnar
@ 2019-08-28 20:54   ` Thomas Gleixner
  2019-08-28 21:52     ` Thomas Gleixner
  2019-08-28 21:54     ` [patch V3 " Thomas Gleixner
  2 siblings, 2 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 20:54 UTC (permalink / raw)
  To: LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

From: Song Liu <songliubraving@fb.com>

pti_clone_pmds() assumes that the supplied address is either:

 - properly PUD/PMD aligned
or
 - the address is actually mapped which means that independent
   of the mapping level (PUD/PMD/PTE) the next higher mapping
   exist.

If that's not the case the unaligned address can be incremented by PUD or
PMD size wrongly. All callers supply mapped and/or aligned addresses, but
for robustness sake, it's better to handle that case proper and to emit a
warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Negate P[UM]D_MASK for checking whether the offset part is 0
---
 arch/x86/mm/pti.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
-			addr += PUD_SIZE;
+			WARN_ON_ONCE(addr & ~PUD_MASK);
+			addr = round_up(addr + 1, PUD_SIZE);
 			continue;
 		}
 
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
-			addr += PMD_SIZE;
+			WARN_ON_ONCE(addr & ~PMD_MASK);
+			addr = round_up(addr + 1, PMD_SIZE);
 			continue;
 		}
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch V2 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 20:54   ` [patch V2 " Thomas Gleixner
@ 2019-08-28 21:52     ` Thomas Gleixner
  2019-08-28 21:54     ` [patch V3 " Thomas Gleixner
  1 sibling, 0 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 21:52 UTC (permalink / raw)
  To: LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On Wed, 28 Aug 2019, Thomas Gleixner wrote:

> From: Song Liu <songliubraving@fb.com>
> 
> pti_clone_pmds() assumes that the supplied address is either:
> 
>  - properly PUD/PMD aligned
> or
>  - the address is actually mapped which means that independent

Bah. I'm a moron. Forgot to fix the spell checker issues.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch V3 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 20:54   ` [patch V2 " Thomas Gleixner
  2019-08-28 21:52     ` Thomas Gleixner
@ 2019-08-28 21:54     ` Thomas Gleixner
  2019-08-28 23:22       ` Ingo Molnar
                         ` (2 more replies)
  1 sibling, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 21:54 UTC (permalink / raw)
  To: LKML
  Cc: x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

From: Song Liu <songliubraving@fb.com>

pti_clone_pmds() assumes that the supplied address is either:

 - properly PUD/PMD aligned
or
 - the address is actually mapped which means that independently
   of the mapping level (PUD/PMD/PTE) the next higher mapping
   exists.

If that's not the case the unaligned address can be incremented by PUD or
PMD size incorrectly. All callers supply mapped and/or aligned addresses,
but for the sake of robustness it's better to handle that case properly and
to emit a warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V2: Negate P[UM]D_MASK for checking whether the offset part is 0
V3: Fix changelog
---
 arch/x86/mm/pti.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
-			addr += PUD_SIZE;
+			WARN_ON_ONCE(addr & ~PUD_MASK);
+			addr = round_up(addr + 1, PUD_SIZE);
 			continue;
 		}
 
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
-			addr += PMD_SIZE;
+			WARN_ON_ONCE(addr & ~PMD_MASK);
+			addr = round_up(addr + 1, PMD_SIZE);
 			continue;
 		}
 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
  2019-08-28 20:32           ` Song Liu
@ 2019-08-28 22:31             ` Thomas Gleixner
  2019-08-28 23:03               ` Song Liu
                                 ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Thomas Gleixner @ 2019-08-28 22:31 UTC (permalink / raw)
  To: Song Liu
  Cc: Dave Hansen, LKML, x86, Joerg Roedel, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel, Steven Rostedt

ftrace does not use text_poke() for enabling trace functionality. It uses
its own mechanism and flips the whole kernel text to RW and back to RO.

The CPA rework removed a loop based check of 4k pages which tried to
preserve a large page by checking each 4k page whether the change would
actually cover all pages in the large page.

This resulted in endless loops for nothing as in testing it turned out that
it actually never preserved anything. Of course testing missed to include
ftrace, which is the one and only case which benefitted from the 4k loop.

As a consequence enabling function tracing or ftrace based kprobes results
in a full 4k split of the kernel text, which affects iTLB performance.

The kernel RO protection is the only valid case where this can actually
preserve large pages.

All other static protections (RO data, data NX, PCI, BIOS) are truly
static.  So a conflict with those protections which results in a split
should only ever happen when a change of memory next to a protected region
is attempted. But these conflicts are rightfully splitting the large page
to preserve the protected regions. In fact a change to the protected
regions itself is a bug and is warned about.

Add an exception for the static protection check for kernel text RO when
the to be changed region spawns a full large page which allows to preserve
the large mappings. This also prevents the syslog to be spammed about CPA
violations when ftrace is used.

The exception needs to be removed once ftrace switched over to text_poke()
which avoids the whole issue.

Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
---
 arch/x86/mm/pageattr.c |   26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -516,7 +516,7 @@ static inline void check_conflict(int wa
  */
 static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
 					  unsigned long pfn, unsigned long npg,
-					  int warnlvl)
+					  unsigned long lpsize, int warnlvl)
 {
 	pgprotval_t forbidden, res;
 	unsigned long end;
@@ -535,9 +535,17 @@ static inline pgprot_t static_protection
 	check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
 	forbidden = res;
 
-	res = protect_kernel_text_ro(start, end);
-	check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
-	forbidden |= res;
+	/*
+	 * Special case to preserve a large page. If the change spawns the
+	 * full large page mapping then there is no point to split it
+	 * up. Happens with ftrace and is going to be removed once ftrace
+	 * switched to text_poke().
+	 */
+	if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
+		res = protect_kernel_text_ro(start, end);
+		check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
+		forbidden |= res;
+	}
 
 	/* Check the PFN directly */
 	res = protect_pci_bios(pfn, pfn + npg - 1);
@@ -819,7 +827,7 @@ static int __should_split_large_page(pte
 	 * extra conditional required here.
 	 */
 	chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
-				      CPA_CONFLICT);
+				      psize, CPA_CONFLICT);
 
 	if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
 		/*
@@ -855,7 +863,7 @@ static int __should_split_large_page(pte
 	 * protection requirement in the large page.
 	 */
 	new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
-				      CPA_DETECT);
+				      psize, CPA_DETECT);
 
 	/*
 	 * If there is a conflict, split the large page.
@@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_dat
 	if (!cpa->force_static_prot)
 		goto set;
 
-	prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
+	/* Hand in lpsize = 0 to enforce the protection mechanism */
+	prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);
 
 	if (pgprot_val(prot) == pgprot_val(ref_prot))
 		goto set;
@@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa
 		pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
 
 		cpa_inc_4k_install();
-		new_prot = static_protections(new_prot, address, pfn, 1,
+		/* Hand in lpsize = 0 to enforce the protection mechanism */
+		new_prot = static_protections(new_prot, address, pfn, 1, 0,
 					      CPA_PROTECT);
 
 		new_prot = pgprot_clear_protnone_bits(new_prot);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
  2019-08-28 22:31             ` [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text Thomas Gleixner
@ 2019-08-28 23:03               ` Song Liu
  2019-08-29 13:01               ` Peter Zijlstra
  2019-08-29 18:55               ` [tip: x86/urgent] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 26+ messages in thread
From: Song Liu @ 2019-08-28 23:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, LKML, x86, Joerg Roedel, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel, Steven Rostedt


> On Aug 28, 2019, at 3:31 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> ftrace does not use text_poke() for enabling trace functionality. It uses
> its own mechanism and flips the whole kernel text to RW and back to RO.
> 
> The CPA rework removed a loop based check of 4k pages which tried to
> preserve a large page by checking each 4k page whether the change would
> actually cover all pages in the large page.
> 
> This resulted in endless loops for nothing as in testing it turned out that
> it actually never preserved anything. Of course testing missed to include
> ftrace, which is the one and only case which benefitted from the 4k loop.
> 
> As a consequence enabling function tracing or ftrace based kprobes results
> in a full 4k split of the kernel text, which affects iTLB performance.
> 
> The kernel RO protection is the only valid case where this can actually
> preserve large pages.
> 
> All other static protections (RO data, data NX, PCI, BIOS) are truly
> static.  So a conflict with those protections which results in a split
> should only ever happen when a change of memory next to a protected region
> is attempted. But these conflicts are rightfully splitting the large page
> to preserve the protected regions. In fact a change to the protected
> regions itself is a bug and is warned about.
> 
> Add an exception for the static protection check for kernel text RO when
> the to be changed region spawns a full large page which allows to preserve
> the large mappings. This also prevents the syslog to be spammed about CPA
> violations when ftrace is used.
> 
> The exception needs to be removed once ftrace switched over to text_poke()
> which avoids the whole issue.
> 
> Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
> Reported-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: stable@vger.kernel.org

This looks great. Much cleaner than my workaround. 

Thanks!

Reviewed-and-tested-by: Song Liu <songliubraving@fb.com>

We need this for v4.20 to v5.3 (assuming Peter's patches will land in 5.4). 


> ---
> arch/x86/mm/pageattr.c |   26 ++++++++++++++++++--------
> 1 file changed, 18 insertions(+), 8 deletions(-)
> 
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -516,7 +516,7 @@ static inline void check_conflict(int wa
>  */
> static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
> 					  unsigned long pfn, unsigned long npg,
> -					  int warnlvl)
> +					  unsigned long lpsize, int warnlvl)
> {
> 	pgprotval_t forbidden, res;
> 	unsigned long end;
> @@ -535,9 +535,17 @@ static inline pgprot_t static_protection
> 	check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
> 	forbidden = res;
> 
> -	res = protect_kernel_text_ro(start, end);
> -	check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> -	forbidden |= res;
> +	/*
> +	 * Special case to preserve a large page. If the change spawns the
> +	 * full large page mapping then there is no point to split it
> +	 * up. Happens with ftrace and is going to be removed once ftrace
> +	 * switched to text_poke().
> +	 */
> +	if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
> +		res = protect_kernel_text_ro(start, end);
> +		check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> +		forbidden |= res;
> +	}
> 
> 	/* Check the PFN directly */
> 	res = protect_pci_bios(pfn, pfn + npg - 1);
> @@ -819,7 +827,7 @@ static int __should_split_large_page(pte
> 	 * extra conditional required here.
> 	 */
> 	chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
> -				      CPA_CONFLICT);
> +				      psize, CPA_CONFLICT);
> 
> 	if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
> 		/*
> @@ -855,7 +863,7 @@ static int __should_split_large_page(pte
> 	 * protection requirement in the large page.
> 	 */
> 	new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
> -				      CPA_DETECT);
> +				      psize, CPA_DETECT);
> 
> 	/*
> 	 * If there is a conflict, split the large page.
> @@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_dat
> 	if (!cpa->force_static_prot)
> 		goto set;
> 
> -	prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
> +	/* Hand in lpsize = 0 to enforce the protection mechanism */
> +	prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);
> 
> 	if (pgprot_val(prot) == pgprot_val(ref_prot))
> 		goto set;
> @@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa
> 		pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
> 
> 		cpa_inc_4k_install();
> -		new_prot = static_protections(new_prot, address, pfn, 1,
> +		/* Hand in lpsize = 0 to enforce the protection mechanism */
> +		new_prot = static_protections(new_prot, address, pfn, 1, 0,
> 					      CPA_PROTECT);
> 
> 		new_prot = pgprot_clear_protnone_bits(new_prot);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch V3 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 21:54     ` [patch V3 " Thomas Gleixner
@ 2019-08-28 23:22       ` Ingo Molnar
  2019-08-29 19:02       ` [tip: x86/pti] " tip-bot2 for Song Liu
  2019-08-30 10:24       ` [patch V3 1/2] " Joerg Roedel
  2 siblings, 0 replies; 26+ messages in thread
From: Ingo Molnar @ 2019-08-28 23:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Song Liu, Joerg Roedel, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel


* Thomas Gleixner <tglx@linutronix.de> wrote:

> From: Song Liu <songliubraving@fb.com>
> 
> pti_clone_pmds() assumes that the supplied address is either:
> 
>  - properly PUD/PMD aligned
> or
>  - the address is actually mapped which means that independently
>    of the mapping level (PUD/PMD/PTE) the next higher mapping
>    exists.
> 
> If that's not the case the unaligned address can be incremented by PUD or
> PMD size incorrectly. All callers supply mapped and/or aligned addresses,
> but for the sake of robustness it's better to handle that case properly and
> to emit a warning.
> 
> [ tglx: Rewrote changelog and added WARN_ON_ONCE() ]
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> V2: Negate P[UM]D_MASK for checking whether the offset part is 0
> V3: Fix changelog
> ---
>  arch/x86/mm/pti.c |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
>  
>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			WARN_ON_ONCE(addr & ~PUD_MASK);
> +			addr = round_up(addr + 1, PUD_SIZE);
>  			continue;
>  		}
>  
>  		pmd = pmd_offset(pud, addr);
>  		if (pmd_none(*pmd)) {
> -			addr += PMD_SIZE;
> +			WARN_ON_ONCE(addr & ~PMD_MASK);
> +			addr = round_up(addr + 1, PMD_SIZE);
>  			continue;
>  		}

Reviewed-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
  2019-08-28 22:31             ` [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text Thomas Gleixner
  2019-08-28 23:03               ` Song Liu
@ 2019-08-29 13:01               ` Peter Zijlstra
  2019-08-29 18:55               ` [tip: x86/urgent] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2019-08-29 13:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Song Liu, Dave Hansen, LKML, x86, Joerg Roedel, Andy Lutomirski,
	Rik van Riel, Steven Rostedt

On Thu, Aug 29, 2019 at 12:31:34AM +0200, Thomas Gleixner wrote:
>  arch/x86/mm/pageattr.c |   26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -516,7 +516,7 @@ static inline void check_conflict(int wa
>   */
>  static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
>  					  unsigned long pfn, unsigned long npg,
> -					  int warnlvl)
> +					  unsigned long lpsize, int warnlvl)
>  {
>  	pgprotval_t forbidden, res;
>  	unsigned long end;
> @@ -535,9 +535,17 @@ static inline pgprot_t static_protection
>  	check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
>  	forbidden = res;
>  
> -	res = protect_kernel_text_ro(start, end);
> -	check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> -	forbidden |= res;
> +	/*
> +	 * Special case to preserve a large page. If the change spawns the
> +	 * full large page mapping then there is no point to split it
> +	 * up. Happens with ftrace and is going to be removed once ftrace
> +	 * switched to text_poke().
> +	 */
> +	if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
> +		res = protect_kernel_text_ro(start, end);
> +		check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> +		forbidden |= res;
> +	}

Right, so this allows the RW (doesn't enforce RO) and thereby doesn't
force split, when it is a whole large page.

>  
>  	/* Check the PFN directly */
>  	res = protect_pci_bios(pfn, pfn + npg - 1);
> @@ -819,7 +827,7 @@ static int __should_split_large_page(pte
>  	 * extra conditional required here.
>  	 */
>  	chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
> -				      CPA_CONFLICT);
> +				      psize, CPA_CONFLICT);
>  
>  	if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
>  		/*
> @@ -855,7 +863,7 @@ static int __should_split_large_page(pte
>  	 * protection requirement in the large page.
>  	 */
>  	new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
> -				      CPA_DETECT);
> +				      psize, CPA_DETECT);
>  
>  	/*
>  	 * If there is a conflict, split the large page.

And these are the callsites in __should_split_large_page(), and you
provide psize, and therefore we allow RW to preserve the large pages on
the kernel text.

> @@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_dat
>  	if (!cpa->force_static_prot)
>  		goto set;
>  
> -	prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
> +	/* Hand in lpsize = 0 to enforce the protection mechanism */
> +	prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);

This is when we've already decided to split, in which case we might as
well enforce the normal rules, and .lpsize=0 does just that.

>  
>  	if (pgprot_val(prot) == pgprot_val(ref_prot))
>  		goto set;
> @@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa
>  		pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
>  
>  		cpa_inc_4k_install();
> -		new_prot = static_protections(new_prot, address, pfn, 1,
> +		/* Hand in lpsize = 0 to enforce the protection mechanism */
> +		new_prot = static_protections(new_prot, address, pfn, 1, 0,
>  					      CPA_PROTECT);

And here we check the protections of a single 4k page, in which case
large pages are irrelevant and again .lpsize=0 disables the new code.

>  
>  		new_prot = pgprot_clear_protnone_bits(new_prot);


That all seems OK I suppose.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [tip: x86/urgent] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
  2019-08-28 22:31             ` [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text Thomas Gleixner
  2019-08-28 23:03               ` Song Liu
  2019-08-29 13:01               ` Peter Zijlstra
@ 2019-08-29 18:55               ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 26+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2019-08-29 18:55 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Song Liu, Thomas Gleixner, Peter Zijlstra (Intel),
	stable, Ingo Molnar, Borislav Petkov, linux-kernel

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID:     7af0145067bc429a09ac4047b167c0971c9f0dc7
Gitweb:        https://git.kernel.org/tip/7af0145067bc429a09ac4047b167c0971c9f0dc7
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Thu, 29 Aug 2019 00:31:34 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 29 Aug 2019 20:48:44 +02:00

x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text

ftrace does not use text_poke() for enabling trace functionality. It uses
its own mechanism and flips the whole kernel text to RW and back to RO.

The CPA rework removed a loop based check of 4k pages which tried to
preserve a large page by checking each 4k page whether the change would
actually cover all pages in the large page.

This resulted in endless loops for nothing as in testing it turned out that
it actually never preserved anything. Of course testing missed to include
ftrace, which is the one and only case which benefitted from the 4k loop.

As a consequence enabling function tracing or ftrace based kprobes results
in a full 4k split of the kernel text, which affects iTLB performance.

The kernel RO protection is the only valid case where this can actually
preserve large pages.

All other static protections (RO data, data NX, PCI, BIOS) are truly
static.  So a conflict with those protections which results in a split
should only ever happen when a change of memory next to a protected region
is attempted. But these conflicts are rightfully splitting the large page
to preserve the protected regions. In fact a change to the protected
regions itself is a bug and is warned about.

Add an exception for the static protection check for kernel text RO when
the to be changed region spawns a full large page which allows to preserve
the large mappings. This also prevents the syslog to be spammed about CPA
violations when ftrace is used.

The exception needs to be removed once ftrace switched over to text_poke()
which avoids the whole issue.

Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282355340.1938@nanos.tec.linutronix.de
---
 arch/x86/mm/pageattr.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a9a77a..e14e95e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -516,7 +516,7 @@ static inline void check_conflict(int warnlvl, pgprot_t prot, pgprotval_t val,
  */
 static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
 					  unsigned long pfn, unsigned long npg,
-					  int warnlvl)
+					  unsigned long lpsize, int warnlvl)
 {
 	pgprotval_t forbidden, res;
 	unsigned long end;
@@ -535,9 +535,17 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
 	check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
 	forbidden = res;
 
-	res = protect_kernel_text_ro(start, end);
-	check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
-	forbidden |= res;
+	/*
+	 * Special case to preserve a large page. If the change spawns the
+	 * full large page mapping then there is no point to split it
+	 * up. Happens with ftrace and is going to be removed once ftrace
+	 * switched to text_poke().
+	 */
+	if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
+		res = protect_kernel_text_ro(start, end);
+		check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
+		forbidden |= res;
+	}
 
 	/* Check the PFN directly */
 	res = protect_pci_bios(pfn, pfn + npg - 1);
@@ -819,7 +827,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
 	 * extra conditional required here.
 	 */
 	chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
-				      CPA_CONFLICT);
+				      psize, CPA_CONFLICT);
 
 	if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
 		/*
@@ -855,7 +863,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
 	 * protection requirement in the large page.
 	 */
 	new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
-				      CPA_DETECT);
+				      psize, CPA_DETECT);
 
 	/*
 	 * If there is a conflict, split the large page.
@@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn,
 	if (!cpa->force_static_prot)
 		goto set;
 
-	prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
+	/* Hand in lpsize = 0 to enforce the protection mechanism */
+	prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);
 
 	if (pgprot_val(prot) == pgprot_val(ref_prot))
 		goto set;
@@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
 		pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
 
 		cpa_inc_4k_install();
-		new_prot = static_protections(new_prot, address, pfn, 1,
+		/* Hand in lpsize = 0 to enforce the protection mechanism */
+		new_prot = static_protections(new_prot, address, pfn, 1, 0,
 					      CPA_PROTECT);
 
 		new_prot = pgprot_clear_protnone_bits(new_prot);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip: x86/pti] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 21:54     ` [patch V3 " Thomas Gleixner
  2019-08-28 23:22       ` Ingo Molnar
@ 2019-08-29 19:02       ` tip-bot2 for Song Liu
  2019-08-30 10:24       ` [patch V3 1/2] " Joerg Roedel
  2 siblings, 0 replies; 26+ messages in thread
From: tip-bot2 for Song Liu @ 2019-08-29 19:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Song Liu, Thomas Gleixner, Ingo Molnar, Peter Zijlstra (Intel),
	Borislav Petkov, linux-kernel

The following commit has been merged into the x86/pti branch of tip:

Commit-ID:     825d0b73cd7526b0bb186798583fae810091cbac
Gitweb:        https://git.kernel.org/tip/825d0b73cd7526b0bb186798583fae810091cbac
Author:        Song Liu <songliubraving@fb.com>
AuthorDate:    Wed, 28 Aug 2019 23:54:55 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 29 Aug 2019 20:52:52 +02:00

x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

pti_clone_pmds() assumes that the supplied address is either:

 - properly PUD/PMD aligned
or
 - the address is actually mapped which means that independently
   of the mapping level (PUD/PMD/PTE) the next higher mapping
   exists.

If that's not the case the unaligned address can be incremented by PUD or
PMD size incorrectly. All callers supply mapped and/or aligned addresses,
but for the sake of robustness it's better to handle that case properly and
to emit a warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282352470.1938@nanos.tec.linutronix.de

---
 arch/x86/mm/pti.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524..a24487b 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
-			addr += PUD_SIZE;
+			WARN_ON_ONCE(addr & ~PUD_MASK);
+			addr = round_up(addr + 1, PUD_SIZE);
 			continue;
 		}
 
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
-			addr += PMD_SIZE;
+			WARN_ON_ONCE(addr & ~PMD_MASK);
+			addr = round_up(addr + 1, PMD_SIZE);
 			continue;
 		}
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [tip: x86/pti] x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
                     ` (2 preceding siblings ...)
  2019-08-28 19:00   ` Ingo Molnar
@ 2019-08-29 19:02   ` tip-bot2 for Thomas Gleixner
  2019-08-30 10:25   ` [patch 2/2] " Joerg Roedel
  4 siblings, 0 replies; 26+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2019-08-29 19:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Thomas Gleixner, Dave Hansen, Ingo Molnar, Song Liu,
	Peter Zijlstra (Intel),
	Borislav Petkov, linux-kernel

The following commit has been merged into the x86/pti branch of tip:

Commit-ID:     990784b57731192b7d90c8d4049e6318d81e887d
Gitweb:        https://git.kernel.org/tip/990784b57731192b7d90c8d4049e6318d81e887d
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Wed, 28 Aug 2019 16:24:47 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 29 Aug 2019 20:52:53 +02:00

x86/mm/pti: Do not invoke PTI functions when PTI is disabled

When PTI is disabled at boot time either because the CPU is not affected or
PTI has been disabled on the command line, the boot code still calls into
pti_finalize() which then unconditionally invokes:

     pti_clone_entry_text()
     pti_clone_kernel_text()

pti_clone_kernel_text() was called unconditionally before the 32bit support
was added and 32bit added the call to pti_clone_entry_text().

The call has no side effects as cloning the page tables into the available
second one, which was allocated for PTI does not create damage. But it does
not make sense either and in case that this functionality would be extended
later this might actually lead to hard to diagnose issues.

Neither function should be called when PTI is runtime disabled. Make the
invocation conditional.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190828143124.063353972@linutronix.de

---
 arch/x86/mm/pti.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index a24487b..7f21404 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -668,6 +668,8 @@ void __init pti_init(void)
  */
 void pti_finalize(void)
 {
+	if (!boot_cpu_has(X86_FEATURE_PTI))
+		return;
 	/*
 	 * We need to clone everything (again) that maps parts of the
 	 * kernel image.

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [patch V3 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()
  2019-08-28 21:54     ` [patch V3 " Thomas Gleixner
  2019-08-28 23:22       ` Ingo Molnar
  2019-08-29 19:02       ` [tip: x86/pti] " tip-bot2 for Song Liu
@ 2019-08-30 10:24       ` Joerg Roedel
  2 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2019-08-30 10:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Song Liu, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On Wed, Aug 28, 2019 at 11:54:55PM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
>  
>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			WARN_ON_ONCE(addr & ~PUD_MASK);
> +			addr = round_up(addr + 1, PUD_SIZE);
>  			continue;
>  		}
>  
>  		pmd = pmd_offset(pud, addr);
>  		if (pmd_none(*pmd)) {
> -			addr += PMD_SIZE;
> +			WARN_ON_ONCE(addr & ~PMD_MASK);
> +			addr = round_up(addr + 1, PMD_SIZE);
>  			continue;
>  		}
>

Reviewed-by: Joerg Roedel <jroedel@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled
  2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
                     ` (3 preceding siblings ...)
  2019-08-29 19:02   ` [tip: x86/pti] " tip-bot2 for Thomas Gleixner
@ 2019-08-30 10:25   ` Joerg Roedel
  4 siblings, 0 replies; 26+ messages in thread
From: Joerg Roedel @ 2019-08-30 10:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Song Liu, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Rik van Riel

On Wed, Aug 28, 2019 at 04:24:47PM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -668,6 +668,8 @@ void __init pti_init(void)
>   */
>  void pti_finalize(void)
>  {
> +	if (!boot_cpu_has(X86_FEATURE_PTI))
> +		return;
>  	/*
>  	 * We need to clone everything (again) that maps parts of the
>  	 * kernel image.
>

Reviewed-by: Joerg Roedel <jroedel@suse.de>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-08-30 10:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-28 14:24 [patch 0/2] x86/mm/pti: Robustness updates Thomas Gleixner
2019-08-28 14:24 ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Thomas Gleixner
2019-08-28 15:46   ` Dave Hansen
2019-08-28 15:51     ` Thomas Gleixner
2019-08-28 17:58       ` Song Liu
2019-08-28 20:05         ` Thomas Gleixner
2019-08-28 20:32           ` Song Liu
2019-08-28 22:31             ` [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text Thomas Gleixner
2019-08-28 23:03               ` Song Liu
2019-08-29 13:01               ` Peter Zijlstra
2019-08-29 18:55               ` [tip: x86/urgent] " tip-bot2 for Thomas Gleixner
2019-08-28 18:58   ` [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable() Ingo Molnar
2019-08-28 19:45     ` Thomas Gleixner
2019-08-28 20:54   ` [patch V2 " Thomas Gleixner
2019-08-28 21:52     ` Thomas Gleixner
2019-08-28 21:54     ` [patch V3 " Thomas Gleixner
2019-08-28 23:22       ` Ingo Molnar
2019-08-29 19:02       ` [tip: x86/pti] " tip-bot2 for Song Liu
2019-08-30 10:24       ` [patch V3 1/2] " Joerg Roedel
2019-08-28 14:24 ` [patch 2/2] x86/mm/pti: Do not invoke PTI functions when PTI is disabled Thomas Gleixner
2019-08-28 15:47   ` Dave Hansen
2019-08-28 17:49   ` Song Liu
2019-08-28 19:00   ` Ingo Molnar
2019-08-29 19:02   ` [tip: x86/pti] " tip-bot2 for Thomas Gleixner
2019-08-30 10:25   ` [patch 2/2] " Joerg Roedel
2019-08-28 16:03 ` [patch 0/2] x86/mm/pti: Robustness updates Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.