* [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
@ 2019-08-20 7:51 Song Liu
2019-08-20 9:12 ` Thomas Gleixner
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Song Liu @ 2019-08-20 7:51 UTC (permalink / raw)
To: linux-kernel, linux-mm
Cc: kernel-team, Song Liu, stable, Joerg Roedel, Thomas Gleixner,
Dave Hansen, Andy Lutomirski, Peter Zijlstra
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
This is not accurate because addr may not be PUD_SIZE aligned.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
of this issuse, including PMD for the irq entry table. For a memcache
like workload, this introduces about 4.5x more iTLB-load and about 2.5x
more iTLB-load-misses on a Skylake CPU.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
case.
Cc: stable@vger.kernel.org # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
arch/x86/mm/pti.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..5a67c3015f59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ addr += PMD_SIZE;
continue;
}
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 7:51 [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE Song Liu
@ 2019-08-20 9:12 ` Thomas Gleixner
2019-08-20 13:17 ` Song Liu
2019-08-20 10:00 ` Peter Zijlstra
2019-08-20 13:57 ` Dave Hansen
2 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2019-08-20 9:12 UTC (permalink / raw)
To: Song Liu
Cc: linux-kernel, linux-mm, kernel-team, stable, Joerg Roedel,
Dave Hansen, Andy Lutomirski, Peter Zijlstra
On Tue, 20 Aug 2019, Song Liu wrote:
> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.
You fail to explain how this happened. The code before the 32bit support
did always increase by PMD_SIZE. The 32bit support broke that.
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
This information is largely irrelevant. What matters is the fact that this
got broken and incorrectly forwards the address by PUD_SIZE which is wrong
if address is not PUD_SIZE aligned.
> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.
git grep 'This patch' Documentation/process/submitting-patches.rst
> Cc: stable@vger.kernel.org # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
> arch/x86/mm/pti.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
The right fix is to skip forward to the next PUD boundary instead of doing
this in a loop with PMD_SIZE increments.
Thanks,
tglx
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 7:51 [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE Song Liu
2019-08-20 9:12 ` Thomas Gleixner
@ 2019-08-20 10:00 ` Peter Zijlstra
2019-08-20 11:16 ` Thomas Gleixner
2019-08-20 13:19 ` Song Liu
2019-08-20 13:57 ` Dave Hansen
2 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2019-08-20 10:00 UTC (permalink / raw)
To: Song Liu
Cc: linux-kernel, linux-mm, kernel-team, stable, Joerg Roedel,
Thomas Gleixner, Dave Hansen, Andy Lutomirski
On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.
>
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
>
> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
> continue;
> }
I'm thinking you're right in that there's a bug here, but I'm also
thinking your patch is both incomplete and broken.
What that code wants to do is skip to the end of the pud, a pmd_size
increase will not do that. And right below this, there's a second
instance of this exact pattern.
Did I get the below right?
---
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..32b20b3cb227 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
+ addr &= PUD_MASK;
addr += PUD_SIZE;
continue;
}
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
+ addr &= PMD_MASK;
addr += PMD_SIZE;
continue;
}
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 10:00 ` Peter Zijlstra
@ 2019-08-20 11:16 ` Thomas Gleixner
2019-08-20 13:21 ` Song Liu
2019-08-20 13:21 ` [PATCH] " Rik van Riel
2019-08-20 13:19 ` Song Liu
1 sibling, 2 replies; 17+ messages in thread
From: Thomas Gleixner @ 2019-08-20 11:16 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Song Liu, linux-kernel, linux-mm, kernel-team, stable,
Joerg Roedel, Dave Hansen, Andy Lutomirski
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
>
> Did I get the below right?
>
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> + addr &= PUD_MASK;
> addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
perhaps?
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> + addr &= PMD_MASK;
> addr += PMD_SIZE;
> continue;
> }
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 9:12 ` Thomas Gleixner
@ 2019-08-20 13:17 ` Song Liu
0 siblings, 0 replies; 17+ messages in thread
From: Song Liu @ 2019-08-20 13:17 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Linux Kernel Mailing List, Linux MM, Kernel Team, stable,
Joerg Roedel, Dave Hansen, Andy Lutomirski, Peter Zijlstra
> On Aug 20, 2019, at 2:12 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, 20 Aug 2019, Song Liu wrote:
>
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>
> You fail to explain how this happened. The code before the 32bit support
> did always increase by PMD_SIZE. The 32bit support broke that.
Will fix.
>
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>
> This information is largely irrelevant. What matters is the fact that this
> got broken and incorrectly forwards the address by PUD_SIZE which is wrong
> if address is not PUD_SIZE aligned.
We started looking into this because we cannot explain the regression in
iTLB miss rate. I guess the patch itself explains it pretty well, so the
original issue doesn't matter that much?
I will remove this part.
>
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
>
> git grep 'This patch' Documentation/process/submitting-patches.rst
Will fix.
>> Cc: stable@vger.kernel.org # v4.19+
>> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> Cc: Joerg Roedel <jroedel@suse.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> ---
>> arch/x86/mm/pti.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> - addr += PUD_SIZE;
>> + addr += PMD_SIZE;
>
> The right fix is to skip forward to the next PUD boundary instead of doing
> this in a loop with PMD_SIZE increments.
Agreed.
Thanks,
Song
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 10:00 ` Peter Zijlstra
2019-08-20 11:16 ` Thomas Gleixner
@ 2019-08-20 13:19 ` Song Liu
1 sibling, 0 replies; 17+ messages in thread
From: Song Liu @ 2019-08-20 13:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Linux Kernel Mailing List, linux-mm, Kernel Team, stable,
Joerg Roedel, Thomas Gleixner, Dave Hansen, Andy Lutomirski
> On Aug 20, 2019, at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>>
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>>
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
>
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> - addr += PUD_SIZE;
>> + addr += PMD_SIZE;
>> continue;
>> }
>
> I'm thinking you're right in that there's a bug here, but I'm also
> thinking your patch is both incomplete and broken.
>
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
>
> Did I get the below right?
Yes, your are right.
Thanks,
Song
>
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> + addr &= PUD_MASK;
> addr += PUD_SIZE;
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> + addr &= PMD_MASK;
> addr += PMD_SIZE;
> continue;
> }
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 11:16 ` Thomas Gleixner
@ 2019-08-20 13:21 ` Song Liu
2019-08-20 13:39 ` Thomas Gleixner
2019-08-20 13:55 ` Rik van Riel
2019-08-20 13:21 ` [PATCH] " Rik van Riel
1 sibling, 2 replies; 17+ messages in thread
From: Song Liu @ 2019-08-20 13:21 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Peter Zijlstra, linux-kernel, linux-mm, Kernel Team, stable,
Joerg Roedel, Dave Hansen, Andy Lutomirski
> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>> What that code wants to do is skip to the end of the pud, a pmd_size
>> increase will not do that. And right below this, there's a second
>> instance of this exact pattern.
>>
>> Did I get the below right?
>>
>> ---
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..32b20b3cb227 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> + addr &= PUD_MASK;
>> addr += PUD_SIZE;
>
> round_up(addr, PUD_SIZE);
I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
Thanks,
Song
>
> perhaps?
>
>> continue;
>> }
>>
>> pmd = pmd_offset(pud, addr);
>> if (pmd_none(*pmd)) {
>> + addr &= PMD_MASK;
>> addr += PMD_SIZE;
>> continue;
>> }
>>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 11:16 ` Thomas Gleixner
2019-08-20 13:21 ` Song Liu
@ 2019-08-20 13:21 ` Rik van Riel
1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2019-08-20 13:21 UTC (permalink / raw)
To: Thomas Gleixner, Peter Zijlstra
Cc: Song Liu, linux-kernel, linux-mm, Kernel Team, stable,
Joerg Roedel, Dave Hansen, Andy Lutomirski
On Tue, 2019-08-20 at 13:16 +0200, Thomas Gleixner wrote:
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > What that code wants to do is skip to the end of the pud, a
> > pmd_size
> > increase will not do that. And right below this, there's a second
> > instance of this exact pattern.
> >
> > Did I get the below right?
> >
> > ---
> > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > index b196524759ec..32b20b3cb227 100644
> > --- a/arch/x86/mm/pti.c
> > +++ b/arch/x86/mm/pti.c
> > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > unsigned long end,
> >
> > pud = pud_offset(p4d, addr);
> > if (pud_none(*pud)) {
> > + addr &= PUD_MASK;
> > addr += PUD_SIZE;
>
> round_up(addr, PUD_SIZE);
>
> perhaps?
Won't that keep returning the same address forever
if addr & PUD_MASK == 0?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 13:21 ` Song Liu
@ 2019-08-20 13:39 ` Thomas Gleixner
2019-08-20 13:55 ` Rik van Riel
1 sibling, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2019-08-20 13:39 UTC (permalink / raw)
To: Song Liu
Cc: Peter Zijlstra, linux-kernel, linux-mm, Kernel Team, stable,
Joerg Roedel, Dave Hansen, Andy Lutomirski
On Tue, 20 Aug 2019, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> >> What that code wants to do is skip to the end of the pud, a pmd_size
> >> increase will not do that. And right below this, there's a second
> >> instance of this exact pattern.
> >>
> >> Did I get the below right?
> >>
> >> ---
> >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> >> index b196524759ec..32b20b3cb227 100644
> >> --- a/arch/x86/mm/pti.c
> >> +++ b/arch/x86/mm/pti.c
> >> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> >>
> >> pud = pud_offset(p4d, addr);
> >> if (pud_none(*pud)) {
> >> + addr &= PUD_MASK;
> >> addr += PUD_SIZE;
> >
> > round_up(addr, PUD_SIZE);
>
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
Right you are.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 13:21 ` Song Liu
2019-08-20 13:39 ` Thomas Gleixner
@ 2019-08-20 13:55 ` Rik van Riel
2019-08-20 14:00 ` Song Liu
1 sibling, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2019-08-20 13:55 UTC (permalink / raw)
To: Song Liu, Thomas Gleixner
Cc: Peter Zijlstra, linux-kernel, linux-mm, Kernel Team, stable,
Joerg Roedel, Dave Hansen, Andy Lutomirski
On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de>
> > wrote:
> >
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > > What that code wants to do is skip to the end of the pud, a
> > > pmd_size
> > > increase will not do that. And right below this, there's a second
> > > instance of this exact pattern.
> > >
> > > Did I get the below right?
> > >
> > > ---
> > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > > index b196524759ec..32b20b3cb227 100644
> > > --- a/arch/x86/mm/pti.c
> > > +++ b/arch/x86/mm/pti.c
> > > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > > unsigned long end,
> > >
> > > pud = pud_offset(p4d, addr);
> > > if (pud_none(*pud)) {
> > > + addr &= PUD_MASK;
> > > addr += PUD_SIZE;
> >
> > round_up(addr, PUD_SIZE);
>
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
What does that do if start is less than PMD_SIZE
away from the next PUD_SIZE boundary?
How about: round_up(addr + 1, PUD_SIZE) ?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 7:51 [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE Song Liu
2019-08-20 9:12 ` Thomas Gleixner
2019-08-20 10:00 ` Peter Zijlstra
@ 2019-08-20 13:57 ` Dave Hansen
2019-08-20 14:14 ` Song Liu
2 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2019-08-20 13:57 UTC (permalink / raw)
To: Song Liu, linux-kernel, linux-mm
Cc: kernel-team, stable, Joerg Roedel, Thomas Gleixner, Dave Hansen,
Andy Lutomirski, Peter Zijlstra
On 8/20/19 12:51 AM, Song Liu wrote:
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
I was surprised that this manifests as a performance issue. Usually
messing up PTI page table manipulation means you get to experience the
jobs of debugging triple faults. But, it makes sense if its this line:
/*
* Note that this will undo _some_ of the work that
* pti_set_kernel_image_nonglobal() did to clear the
* global bit.
*/
pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
which is restoring the Global bit.
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
and shouldn't have a global kernel image. Could you confirm whether
PCIDs are supported on this CPU?
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
> continue;
> }
Did we also bugger up this code:
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
addr += PMD_SIZE;
continue;
}
if we're on 32-bit and this:
#define PTI_LEVEL_KERNEL_IMAGE PTI_CLONE_PTE
and we get a hole walking to a non-PMD-aligned address?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 13:55 ` Rik van Riel
@ 2019-08-20 14:00 ` Song Liu
2019-08-20 16:56 ` [PATCH v2] " Rik van Riel
0 siblings, 1 reply; 17+ messages in thread
From: Song Liu @ 2019-08-20 14:00 UTC (permalink / raw)
To: Rik van Riel
Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-mm,
Kernel Team, stable, Joerg Roedel, Dave Hansen, Andy Lutomirski
> On Aug 20, 2019, at 6:55 AM, Rik van Riel <riel@fb.com> wrote:
>
> On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
>>> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de>
>>> wrote:
>>>
>>> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>>>> What that code wants to do is skip to the end of the pud, a
>>>> pmd_size
>>>> increase will not do that. And right below this, there's a second
>>>> instance of this exact pattern.
>>>>
>>>> Did I get the below right?
>>>>
>>>> ---
>>>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>>>> index b196524759ec..32b20b3cb227 100644
>>>> --- a/arch/x86/mm/pti.c
>>>> +++ b/arch/x86/mm/pti.c
>>>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
>>>> unsigned long end,
>>>>
>>>> pud = pud_offset(p4d, addr);
>>>> if (pud_none(*pud)) {
>>>> + addr &= PUD_MASK;
>>>> addr += PUD_SIZE;
>>>
>>> round_up(addr, PUD_SIZE);
>>
>> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
>
> What does that do if start is less than PMD_SIZE
> away from the next PUD_SIZE boundary?
Great point!
>
> How about: round_up(addr + 1, PUD_SIZE) ?
Yes. How about this?
=========================== 8< ============================
From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Mon, 19 Aug 2019 23:59:47 -0700
Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
This behavior changes after the 32-bit support: pti_clone_pgtable()
increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
in these two cases.
Cc: stable@vger.kernel.org # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
arch/x86/mm/pti.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..1337494e22ef 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}
--
2.17.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 13:57 ` Dave Hansen
@ 2019-08-20 14:14 ` Song Liu
2019-08-20 14:18 ` Dave Hansen
0 siblings, 1 reply; 17+ messages in thread
From: Song Liu @ 2019-08-20 14:14 UTC (permalink / raw)
To: Dave Hansen
Cc: Linux Kernel Mailing List, Linux MM, Kernel Team, stable,
Joerg Roedel, Thomas Gleixner, Dave Hansen, Andy Lutomirski,
Peter Zijlstra
> On Aug 20, 2019, at 6:57 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 8/20/19 12:51 AM, Song Liu wrote:
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>
> I was surprised that this manifests as a performance issue. Usually
> messing up PTI page table manipulation means you get to experience the
> jobs of debugging triple faults. But, it makes sense if its this line:
>
> /*
> * Note that this will undo _some_ of the work that
> * pti_set_kernel_image_nonglobal() did to clear the
> * global bit.
> */
> pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
>
> which is restoring the Global bit.
>
> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
> and shouldn't have a global kernel image. Could you confirm whether
> PCIDs are supported on this CPU?
Yes, pcid is listed in /proc/cpuinfo.
Thanks,
Song
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 14:14 ` Song Liu
@ 2019-08-20 14:18 ` Dave Hansen
2019-08-20 16:05 ` Song Liu
0 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2019-08-20 14:18 UTC (permalink / raw)
To: Song Liu
Cc: Linux Kernel Mailing List, Linux MM, Kernel Team, stable,
Joerg Roedel, Thomas Gleixner, Dave Hansen, Andy Lutomirski,
Peter Zijlstra
On 8/20/19 7:14 AM, Song Liu wrote:
>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>> and shouldn't have a global kernel image. Could you confirm whether
>> PCIDs are supported on this CPU?
> Yes, pcid is listed in /proc/cpuinfo.
So what's going on? Could you confirm exactly which pti_clone_pgtable()
is causing you problems? Do you have a theory as to why this manifests
as a performance problem rather than a functional one?
A diff of these:
/sys/kernel/debug/page_tables/current_user
/sys/kernel/debug/page_tables/current_kernel
before and after your patch might be helpful.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 14:18 ` Dave Hansen
@ 2019-08-20 16:05 ` Song Liu
2019-08-20 16:38 ` Song Liu
0 siblings, 1 reply; 17+ messages in thread
From: Song Liu @ 2019-08-20 16:05 UTC (permalink / raw)
To: Dave Hansen
Cc: Linux Kernel Mailing List, Linux MM, Kernel Team, stable,
Joerg Roedel, Thomas Gleixner, Dave Hansen, Andy Lutomirski,
Peter Zijlstra
> On Aug 20, 2019, at 7:18 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 8/20/19 7:14 AM, Song Liu wrote:
>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>> and shouldn't have a global kernel image. Could you confirm whether
>>> PCIDs are supported on this CPU?
>> Yes, pcid is listed in /proc/cpuinfo.
>
> So what's going on? Could you confirm exactly which pti_clone_pgtable()
> is causing you problems? Do you have a theory as to why this manifests
> as a performance problem rather than a functional one?
>
> A diff of these:
>
> /sys/kernel/debug/page_tables/current_user
> /sys/kernel/debug/page_tables/current_kernel
>
> before and after your patch might be helpful.
I believe the difference is from the following entries (7 PMDs)
Before the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
After the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel and kernel show same data though.
Thanks,
Song
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 16:05 ` Song Liu
@ 2019-08-20 16:38 ` Song Liu
0 siblings, 0 replies; 17+ messages in thread
From: Song Liu @ 2019-08-20 16:38 UTC (permalink / raw)
To: Dave Hansen
Cc: Linux Kernel Mailing List, Linux MM, Kernel Team, stable,
Joerg Roedel, Thomas Gleixner, Dave Hansen, Andy Lutomirski,
Peter Zijlstra
> On Aug 20, 2019, at 9:05 AM, Song Liu <songliubraving@fb.com> wrote:
>
>
>
>> On Aug 20, 2019, at 7:18 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 8/20/19 7:14 AM, Song Liu wrote:
>>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>>> and shouldn't have a global kernel image. Could you confirm whether
>>>> PCIDs are supported on this CPU?
>>> Yes, pcid is listed in /proc/cpuinfo.
>>
>> So what's going on? Could you confirm exactly which pti_clone_pgtable()
>> is causing you problems? Do you have a theory as to why this manifests
>> as a performance problem rather than a functional one?
>>
>> A diff of these:
>>
>> /sys/kernel/debug/page_tables/current_user
>> /sys/kernel/debug/page_tables/current_kernel
>>
>> before and after your patch might be helpful.
>
> I believe the difference is from the following entries (7 PMDs)
>
> Before the patch:
>
> current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
> efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
> kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
>
>
> After the patch:
>
> current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
>
> current_kernel and kernel show same data though.
A little more details on how I got here.
We use huge page for hot text and thus reduces iTLB misses. As we
benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
iTLB misses.
To figure out the issue, I use a debug patch that dumps page table for
a pid. The following are information from the workload pid.
For the 4.16 based kernel:
host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
For the 5.2 based kernel before this patch:
host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
irq entry table; while 4.16 kernel doesn't have it.
For the 5.2 based kernel after this patch:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
in 4.16 kernel. This further reduces iTLB miss rate
Thanks,
Song
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
2019-08-20 14:00 ` Song Liu
@ 2019-08-20 16:56 ` Rik van Riel
0 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2019-08-20 16:56 UTC (permalink / raw)
To: Song Liu
Cc: Thomas Gleixner, Peter Zijlstra, linux-kernel, linux-mm,
Kernel Team, stable, Joerg Roedel, Dave Hansen, Andy Lutomirski
[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]
On Tue, 2019-08-20 at 10:00 -0400, Song Liu wrote:
>
> From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00
> 2001
> From: Song Liu <songliubraving@fb.com>
> Date: Mon, 19 Aug 2019 23:59:47 -0700
> Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr
> properly
>
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support: pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases
> addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate
> because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
>
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.
>
> Cc: stable@vger.kernel.org # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for
> 32 bit")
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
This looks like it should do the trick!
Reviewed-by: Rik van Riel <riel@surriel.com>
--
All Rights Reversed.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-08-20 16:56 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-20 7:51 [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE Song Liu
2019-08-20 9:12 ` Thomas Gleixner
2019-08-20 13:17 ` Song Liu
2019-08-20 10:00 ` Peter Zijlstra
2019-08-20 11:16 ` Thomas Gleixner
2019-08-20 13:21 ` Song Liu
2019-08-20 13:39 ` Thomas Gleixner
2019-08-20 13:55 ` Rik van Riel
2019-08-20 14:00 ` Song Liu
2019-08-20 16:56 ` [PATCH v2] " Rik van Riel
2019-08-20 13:21 ` [PATCH] " Rik van Riel
2019-08-20 13:19 ` Song Liu
2019-08-20 13:57 ` Dave Hansen
2019-08-20 14:14 ` Song Liu
2019-08-20 14:18 ` Dave Hansen
2019-08-20 16:05 ` Song Liu
2019-08-20 16:38 ` Song Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).