x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
diff mbox series

Message ID 20190820075128.2912224-1-songliubraving@fb.com
State New
Headers show
Series
  • x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE
Related show

Commit Message

Song Liu Aug. 20, 2019, 7:51 a.m. UTC
pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
This is not accurate because addr may not be PUD_SIZE aligned.

In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
of this issuse, including PMD for the irq entry table. For a memcache
like workload, this introduces about 4.5x more iTLB-load and about 2.5x
more iTLB-load-misses on a Skylake CPU.

This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
case.

Cc: stable@vger.kernel.org # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/mm/pti.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Thomas Gleixner Aug. 20, 2019, 9:12 a.m. UTC | #1
On Tue, 20 Aug 2019, Song Liu wrote:

> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.

You fail to explain how this happened. The code before the 32bit support
did always increase by PMD_SIZE. The 32bit support broke that.
 
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.

This information is largely irrelevant. What matters is the fact that this
got broken and incorrectly forwards the address by PUD_SIZE which is wrong
if address is not PUD_SIZE aligned.

> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.

  git grep 'This patch' Documentation/process/submitting-patches.rst

> Cc: stable@vger.kernel.org # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/x86/mm/pti.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>  
>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			addr += PMD_SIZE;

The right fix is to skip forward to the next PUD boundary instead of doing
this in a loop with PMD_SIZE increments.

Thanks,

	tglx
Peter Zijlstra Aug. 20, 2019, 10 a.m. UTC | #2
On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.
> 
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
> 
> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.

> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>  
>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			addr += PMD_SIZE;
>  			continue;
>  		}

I'm thinking you're right in that there's a bug here, but I'm also
thinking your patch is both incomplete and broken.

What that code wants to do is skip to the end of the pud, a pmd_size
increase will not do that. And right below this, there's a second
instance of this exact pattern.

Did I get the below right?

---
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..32b20b3cb227 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
+			addr &= PUD_MASK;
 			addr += PUD_SIZE;
 			continue;
 		}
 
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
+			addr &= PMD_MASK;
 			addr += PMD_SIZE;
 			continue;
 		}
Thomas Gleixner Aug. 20, 2019, 11:16 a.m. UTC | #3
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
> 
> Did I get the below right?
> 
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>  
>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> +			addr &= PUD_MASK;
>  			addr += PUD_SIZE;

			round_up(addr, PUD_SIZE);

perhaps?

>  			continue;
>  		}
>  
>  		pmd = pmd_offset(pud, addr);
>  		if (pmd_none(*pmd)) {
> +			addr &= PMD_MASK;
>  			addr += PMD_SIZE;
>  			continue;
>  		}
>
Song Liu Aug. 20, 2019, 1:17 p.m. UTC | #4
> On Aug 20, 2019, at 2:12 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, 20 Aug 2019, Song Liu wrote:
> 
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
> 
> You fail to explain how this happened. The code before the 32bit support
> did always increase by PMD_SIZE. The 32bit support broke that.

Will fix. 

> 
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
> 
> This information is largely irrelevant. What matters is the fact that this
> got broken and incorrectly forwards the address by PUD_SIZE which is wrong
> if address is not PUD_SIZE aligned.

We started looking into this because we cannot explain the regression in 
iTLB miss rate. I guess the patch itself explains it pretty well, so the 
original issue doesn't matter that much?

I will remove this part. 

> 
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
> 
>  git grep 'This patch' Documentation/process/submitting-patches.rst

Will fix. 

>> Cc: stable@vger.kernel.org # v4.19+
>> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> Cc: Joerg Roedel <jroedel@suse.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> ---
>> arch/x86/mm/pti.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>> 
>> 		pud = pud_offset(p4d, addr);
>> 		if (pud_none(*pud)) {
>> -			addr += PUD_SIZE;
>> +			addr += PMD_SIZE;
> 
> The right fix is to skip forward to the next PUD boundary instead of doing
> this in a loop with PMD_SIZE increments.

Agreed. 

Thanks,
Song
Song Liu Aug. 20, 2019, 1:19 p.m. UTC | #5
> On Aug 20, 2019, at 3:00 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>> 
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>> 
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
> 
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>> 
>> 		pud = pud_offset(p4d, addr);
>> 		if (pud_none(*pud)) {
>> -			addr += PUD_SIZE;
>> +			addr += PMD_SIZE;
>> 			continue;
>> 		}
> 
> I'm thinking you're right in that there's a bug here, but I'm also
> thinking your patch is both incomplete and broken.
> 
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
> 
> Did I get the below right?

Yes, your are right. 

Thanks,
Song

> 
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> 
> 		pud = pud_offset(p4d, addr);
> 		if (pud_none(*pud)) {
> +			addr &= PUD_MASK;
> 			addr += PUD_SIZE;
> 			continue;
> 		}
> 
> 		pmd = pmd_offset(pud, addr);
> 		if (pmd_none(*pmd)) {
> +			addr &= PMD_MASK;
> 			addr += PMD_SIZE;
> 			continue;
> 		}
Song Liu Aug. 20, 2019, 1:21 p.m. UTC | #6
> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>> What that code wants to do is skip to the end of the pud, a pmd_size
>> increase will not do that. And right below this, there's a second
>> instance of this exact pattern.
>> 
>> Did I get the below right?
>> 
>> ---
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..32b20b3cb227 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>> 
>> 		pud = pud_offset(p4d, addr);
>> 		if (pud_none(*pud)) {
>> +			addr &= PUD_MASK;
>> 			addr += PUD_SIZE;
> 
> 			round_up(addr, PUD_SIZE);

I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". 

Thanks,
Song

> 
> perhaps?
> 
>> 			continue;
>> 		}
>> 
>> 		pmd = pmd_offset(pud, addr);
>> 		if (pmd_none(*pmd)) {
>> +			addr &= PMD_MASK;
>> 			addr += PMD_SIZE;
>> 			continue;
>> 		}
>>
Rik van Riel Aug. 20, 2019, 1:21 p.m. UTC | #7
On Tue, 2019-08-20 at 13:16 +0200, Thomas Gleixner wrote:
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > What that code wants to do is skip to the end of the pud, a
> > pmd_size
> > increase will not do that. And right below this, there's a second
> > instance of this exact pattern.
> > 
> > Did I get the below right?
> > 
> > ---
> > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > index b196524759ec..32b20b3cb227 100644
> > --- a/arch/x86/mm/pti.c
> > +++ b/arch/x86/mm/pti.c
> > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > unsigned long end,
> >  
> >  		pud = pud_offset(p4d, addr);
> >  		if (pud_none(*pud)) {
> > +			addr &= PUD_MASK;
> >  			addr += PUD_SIZE;
> 
> 			round_up(addr, PUD_SIZE);
> 
> perhaps?

Won't that keep returning the same address forever
if addr & PUD_MASK == 0?
Thomas Gleixner Aug. 20, 2019, 1:39 p.m. UTC | #8
On Tue, 20 Aug 2019, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> >> What that code wants to do is skip to the end of the pud, a pmd_size
> >> increase will not do that. And right below this, there's a second
> >> instance of this exact pattern.
> >> 
> >> Did I get the below right?
> >> 
> >> ---
> >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> >> index b196524759ec..32b20b3cb227 100644
> >> --- a/arch/x86/mm/pti.c
> >> +++ b/arch/x86/mm/pti.c
> >> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> >> 
> >> 		pud = pud_offset(p4d, addr);
> >> 		if (pud_none(*pud)) {
> >> +			addr &= PUD_MASK;
> >> 			addr += PUD_SIZE;
> > 
> > 			round_up(addr, PUD_SIZE);
> 
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". 

Right you are.
Rik van Riel Aug. 20, 2019, 1:55 p.m. UTC | #9
On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de>
> > wrote:
> > 
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > > What that code wants to do is skip to the end of the pud, a
> > > pmd_size
> > > increase will not do that. And right below this, there's a second
> > > instance of this exact pattern.
> > > 
> > > Did I get the below right?
> > > 
> > > ---
> > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > > index b196524759ec..32b20b3cb227 100644
> > > --- a/arch/x86/mm/pti.c
> > > +++ b/arch/x86/mm/pti.c
> > > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > > unsigned long end,
> > > 
> > > 		pud = pud_offset(p4d, addr);
> > > 		if (pud_none(*pud)) {
> > > +			addr &= PUD_MASK;
> > > 			addr += PUD_SIZE;
> > 
> > 			round_up(addr, PUD_SIZE);
> 
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". 

What does that do if start is less than PMD_SIZE
away from the next PUD_SIZE boundary?

How about:   round_up(addr + 1, PUD_SIZE)  ?
Dave Hansen Aug. 20, 2019, 1:57 p.m. UTC | #10
On 8/20/19 12:51 AM, Song Liu wrote:
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.

I was surprised that this manifests as a performance issue.  Usually
messing up PTI page table manipulation means you get to experience the
jobs of debugging triple faults.  But, it makes sense if its this line:

        /*
         * Note that this will undo _some_ of the work that
         * pti_set_kernel_image_nonglobal() did to clear the
         * global bit.
         */
        pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);

which is restoring the Global bit.

*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
and shouldn't have a global kernel image.  Could you confirm whether
PCIDs are supported on this CPU?

>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			addr += PMD_SIZE;
>  			continue;
>  		}

Did we also bugger up this code:

                pmd = pmd_offset(pud, addr);
                if (pmd_none(*pmd)) {
                        addr += PMD_SIZE;
                        continue;
                }

if we're on 32-bit and this:

#define PTI_LEVEL_KERNEL_IMAGE  PTI_CLONE_PTE

and we get a hole walking to a non-PMD-aligned address?
Song Liu Aug. 20, 2019, 2 p.m. UTC | #11
> On Aug 20, 2019, at 6:55 AM, Rik van Riel <riel@fb.com> wrote:
> 
> On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
>>> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <tglx@linutronix.de>
>>> wrote:
>>> 
>>> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>>>> What that code wants to do is skip to the end of the pud, a
>>>> pmd_size
>>>> increase will not do that. And right below this, there's a second
>>>> instance of this exact pattern.
>>>> 
>>>> Did I get the below right?
>>>> 
>>>> ---
>>>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>>>> index b196524759ec..32b20b3cb227 100644
>>>> --- a/arch/x86/mm/pti.c
>>>> +++ b/arch/x86/mm/pti.c
>>>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
>>>> unsigned long end,
>>>> 
>>>> 		pud = pud_offset(p4d, addr);
>>>> 		if (pud_none(*pud)) {
>>>> +			addr &= PUD_MASK;
>>>> 			addr += PUD_SIZE;
>>> 
>>> 			round_up(addr, PUD_SIZE);
>> 
>> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)". 
> 
> What does that do if start is less than PMD_SIZE
> away from the next PUD_SIZE boundary?

Great point!

> 
> How about:   round_up(addr + 1, PUD_SIZE)  ?

Yes. How about this?

=========================== 8< ============================

From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Mon, 19 Aug 2019 23:59:47 -0700
Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly

Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
This behavior changes after the 32-bit support:  pti_clone_pgtable()
increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
addr may not be PUD_SIZE/PMD_SIZE aligned.

Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
in these two cases.

Cc: stable@vger.kernel.org # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/mm/pti.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..1337494e22ef 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,

                pud = pud_offset(p4d, addr);
                if (pud_none(*pud)) {
-                       addr += PUD_SIZE;
+                       addr = round_up(addr + 1, PUD_SIZE);
                        continue;
                }

                pmd = pmd_offset(pud, addr);
                if (pmd_none(*pmd)) {
-                       addr += PMD_SIZE;
+                       addr = round_up(addr + 1, PMD_SIZE);
                        continue;
                }

--
2.17.1
Song Liu Aug. 20, 2019, 2:14 p.m. UTC | #12
> On Aug 20, 2019, at 6:57 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 8/20/19 12:51 AM, Song Liu wrote:
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
> 
> I was surprised that this manifests as a performance issue.  Usually
> messing up PTI page table manipulation means you get to experience the
> jobs of debugging triple faults.  But, it makes sense if its this line:
> 
>        /*
>         * Note that this will undo _some_ of the work that
>         * pti_set_kernel_image_nonglobal() did to clear the
>         * global bit.
>         */
>        pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
> 
> which is restoring the Global bit.
> 
> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
> and shouldn't have a global kernel image.  Could you confirm whether
> PCIDs are supported on this CPU?

Yes, pcid is listed in /proc/cpuinfo. 

Thanks,
Song
Dave Hansen Aug. 20, 2019, 2:18 p.m. UTC | #13
On 8/20/19 7:14 AM, Song Liu wrote:
>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>> and shouldn't have a global kernel image.  Could you confirm whether
>> PCIDs are supported on this CPU?
> Yes, pcid is listed in /proc/cpuinfo. 

So what's going on?  Could you confirm exactly which pti_clone_pgtable()
is causing you problems?  Do you have a theory as to why this manifests
as a performance problem rather than a functional one?

A diff of these:

	/sys/kernel/debug/page_tables/current_user
	/sys/kernel/debug/page_tables/current_kernel

before and after your patch might be helpful.
Song Liu Aug. 20, 2019, 4:05 p.m. UTC | #14
> On Aug 20, 2019, at 7:18 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 8/20/19 7:14 AM, Song Liu wrote:
>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>> and shouldn't have a global kernel image.  Could you confirm whether
>>> PCIDs are supported on this CPU?
>> Yes, pcid is listed in /proc/cpuinfo. 
> 
> So what's going on?  Could you confirm exactly which pti_clone_pgtable()
> is causing you problems?  Do you have a theory as to why this manifests
> as a performance problem rather than a functional one?
> 
> A diff of these:
> 
> 	/sys/kernel/debug/page_tables/current_user
> 	/sys/kernel/debug/page_tables/current_kernel
> 
> before and after your patch might be helpful.

I believe the difference is from the following entries (7 PMDs)

Before the patch:

current_kernel:	0xffffffff81000000-0xffffffff81e04000       14352K     ro                 GLB x  pte
efi:		0xffffffff81000000-0xffffffff81e04000       14352K     ro                 GLB x  pte
kernel:		0xffffffff81000000-0xffffffff81e04000       14352K     ro                 GLB x  pte


After the patch:

current_kernel:	0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
efi:		0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
kernel:		0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd

current_kernel and kernel show same data though. 

Thanks,
Song
Song Liu Aug. 20, 2019, 4:38 p.m. UTC | #15
> On Aug 20, 2019, at 9:05 AM, Song Liu <songliubraving@fb.com> wrote:
> 
> 
> 
>> On Aug 20, 2019, at 7:18 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> 
>> On 8/20/19 7:14 AM, Song Liu wrote:
>>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>>> and shouldn't have a global kernel image.  Could you confirm whether
>>>> PCIDs are supported on this CPU?
>>> Yes, pcid is listed in /proc/cpuinfo. 
>> 
>> So what's going on?  Could you confirm exactly which pti_clone_pgtable()
>> is causing you problems?  Do you have a theory as to why this manifests
>> as a performance problem rather than a functional one?
>> 
>> A diff of these:
>> 
>> 	/sys/kernel/debug/page_tables/current_user
>> 	/sys/kernel/debug/page_tables/current_kernel
>> 
>> before and after your patch might be helpful.
> 
> I believe the difference is from the following entries (7 PMDs)
> 
> Before the patch:
> 
> current_kernel:	0xffffffff81000000-0xffffffff81e04000       14352K     ro                 GLB x  pte
> efi:		0xffffffff81000000-0xffffffff81e04000       14352K     ro                 GLB x  pte
> kernel:		0xffffffff81000000-0xffffffff81e04000       14352K     ro                 GLB x  pte
> 
> 
> After the patch:
> 
> current_kernel:	0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
> efi:		0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
> kernel:		0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
> 
> current_kernel and kernel show same data though. 

A little more details on how I got here.

We use huge page for hot text and thus reduces iTLB misses. As we 
benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more 
iTLB misses. 

To figure out the issue, I use a debug patch that dumps page table for 
a pid. The following are information from the workload pid. 


For the 4.16 based kernel:

host-4.16 # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
0xffffffff81a00000-0xffffffff81c00000           2M     ro         PSE         x  pmd


For the 5.2 based kernel before this patch:

host-5.2-before # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd


The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
irq entry table; while 4.16 kernel doesn't have it. 


For the 5.2 based kernel after this patch:

host-5.2-after # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd


So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD 
in 4.16 kernel. This further reduces iTLB miss rate 

Thanks,
Song
Rik van Riel Aug. 20, 2019, 4:56 p.m. UTC | #16
On Tue, 2019-08-20 at 10:00 -0400, Song Liu wrote:
> 
> From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00
> 2001
> From: Song Liu <songliubraving@fb.com>
> Date: Mon, 19 Aug 2019 23:59:47 -0700
> Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr
> properly
> 
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support:  pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases
> addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate
> because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
> 
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.
> 
> Cc: stable@vger.kernel.org # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for
> 32 bit")
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>

This looks like it should do the trick!

Reviewed-by: Rik van Riel <riel@surriel.com>

Patch
diff mbox series

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..5a67c3015f59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,7 +330,7 @@  pti_clone_pgtable(unsigned long start, unsigned long end,
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
-			addr += PUD_SIZE;
+			addr += PMD_SIZE;
 			continue;
 		}