linux-parisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Page tables on machines with >2GB RAM
@ 2020-09-29 15:33 Matthew Wilcox
  2020-09-29 17:01 ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-09-29 15:33 UTC (permalink / raw)
  To: linux-parisc


I think we can end up truncating a PMD or PGD entry (I get confused
easily about levels of the page tables; bear with me)

/* NOTE: even on 64 bits, these entries are __u32 because we allocate
 * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
typedef struct { __u32 pgd; } pgd_t;
...
typedef struct { __u32 pmd; } pmd_t;

...

        pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
                                               PGD_ALLOC_ORDER);
...
        return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);

so if we have more than 2GB of RAM, we can allocate a page with the top
bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
and mask it off, causing us to load the wrong page for the next level
of the page table walk.

Have I missed something?

Oh and I think this bug was introduced in 2004 ...

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Page tables on machines with >2GB RAM
  2020-09-29 15:33 Page tables on machines with >2GB RAM Matthew Wilcox
@ 2020-09-29 17:01 ` Matthew Wilcox
  2020-09-29 17:26   ` John David Anglin
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-09-29 17:01 UTC (permalink / raw)
  To: linux-parisc

On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
> I think we can end up truncating a PMD or PGD entry (I get confused
> easily about levels of the page tables; bear with me)
> 
> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
>  * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
> typedef struct { __u32 pgd; } pgd_t;
> ...
> typedef struct { __u32 pmd; } pmd_t;
> 
> ...
> 
>         pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
>                                                PGD_ALLOC_ORDER);
> ...
>         return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
> 
> so if we have more than 2GB of RAM, we can allocate a page with the top
> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
> and mask it off, causing us to load the wrong page for the next level
> of the page table walk.
> 
> Have I missed something?

Yes, yes I have.

We store the PFN, not the physical address.  So we have 28 bits for
storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
(1TB) of physical address space.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Page tables on machines with >2GB RAM
  2020-09-29 17:01 ` Matthew Wilcox
@ 2020-09-29 17:26   ` John David Anglin
  2020-09-29 18:14     ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: John David Anglin @ 2020-09-29 17:26 UTC (permalink / raw)
  To: Matthew Wilcox, linux-parisc

On 2020-09-29 1:01 p.m., Matthew Wilcox wrote:
> On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
>> I think we can end up truncating a PMD or PGD entry (I get confused
>> easily about levels of the page tables; bear with me)
>>
>> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
>>  * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
>> typedef struct { __u32 pgd; } pgd_t;
>> ...
>> typedef struct { __u32 pmd; } pmd_t;
>>
>> ...
>>
>>         pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
>>                                                PGD_ALLOC_ORDER);
>> ...
>>         return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
>>
>> so if we have more than 2GB of RAM, we can allocate a page with the top
>> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
>> and mask it off, causing us to load the wrong page for the next level
>> of the page table walk.
>>
>> Have I missed something?
> Yes, yes I have.
>
> We store the PFN, not the physical address.  So we have 28 bits for
> storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
> (1TB) of physical address space.
The comment in pgalloc.h says 8TB?  I think improving the description as to how this works
would be welcome.

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Page tables on machines with >2GB RAM
  2020-09-29 17:26   ` John David Anglin
@ 2020-09-29 18:14     ` Matthew Wilcox
  2020-10-04  5:29       ` Helge Deller
  0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-09-29 18:14 UTC (permalink / raw)
  To: John David Anglin; +Cc: linux-parisc

On Tue, Sep 29, 2020 at 01:26:29PM -0400, John David Anglin wrote:
> On 2020-09-29 1:01 p.m., Matthew Wilcox wrote:
> > On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
> >> I think we can end up truncating a PMD or PGD entry (I get confused
> >> easily about levels of the page tables; bear with me)
> >>
> >> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
> >>  * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
> >> typedef struct { __u32 pgd; } pgd_t;
> >> ...
> >> typedef struct { __u32 pmd; } pmd_t;
> >>
> >> ...
> >>
> >>         pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
> >>                                                PGD_ALLOC_ORDER);
> >> ...
> >>         return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
> >>
> >> so if we have more than 2GB of RAM, we can allocate a page with the top
> >> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
> >> and mask it off, causing us to load the wrong page for the next level
> >> of the page table walk.
> >>
> >> Have I missed something?
> > Yes, yes I have.
> >
> > We store the PFN, not the physical address.  So we have 28 bits for
> > storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
> > (1TB) of physical address space.
> The comment in pgalloc.h says 8TB?  I think improving the description as to how this works
> would be welcome.

It's talking about 8TB of virtual address space.  But I think it's wrong.
On 64-bit,

Each PTE defines a 4kB region of address space (ie one page).
Each PMD is a 4kB allocation with 8-byte entries, so covers 512 * 4kB = 2MB
Each PGD is an 8kB allocation with 4-byte entries, so covers 2048 * 2M = 4GB
The top-level allocation is a 32kB allocation, but the first 8kB is used
for the first PGD, so it covers 24kB / 4 bytes * 4GB = 24TB.

I think the top level allocation was supposed to be an order-2 allocation,
which would be an 8TB address space, but it's order-3.

There's a lot of commentary which disagrees with the code.  For example,

#define PMD_ORDER       1 /* Number of pages per pmd */

That's just not true; an order-1 allocation is 2 pages, not 1.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Page tables on machines with >2GB RAM
  2020-09-29 18:14     ` Matthew Wilcox
@ 2020-10-04  5:29       ` Helge Deller
  2020-10-04 12:22         ` Matthew Wilcox
  0 siblings, 1 reply; 6+ messages in thread
From: Helge Deller @ 2020-10-04  5:29 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-parisc

On 9/29/20 8:14 PM, Matthew Wilcox wrote:
> On Tue, Sep 29, 2020 at 01:26:29PM -0400, John David Anglin wrote:
>> On 2020-09-29 1:01 p.m., Matthew Wilcox wrote:
>>> On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
>>>> I think we can end up truncating a PMD or PGD entry (I get confused
>>>> easily about levels of the page tables; bear with me)
>>>>
>>>> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
>>>>  * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
>>>> typedef struct { __u32 pgd; } pgd_t;
>>>> ...
>>>> typedef struct { __u32 pmd; } pmd_t;
>>>>
>>>> ...
>>>>
>>>>         pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
>>>>                                                PGD_ALLOC_ORDER);
>>>> ...
>>>>         return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
>>>>
>>>> so if we have more than 2GB of RAM, we can allocate a page with the top
>>>> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
>>>> and mask it off, causing us to load the wrong page for the next level
>>>> of the page table walk.
>>>>
>>>> Have I missed something?
>>> Yes, yes I have.
>>>
>>> We store the PFN, not the physical address.  So we have 28 bits for
>>> storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
>>> (1TB) of physical address space.
>> The comment in pgalloc.h says 8TB?  I think improving the description as to how this works
>> would be welcome.
>
> It's talking about 8TB of virtual address space.  But I think it's wrong.
> On 64-bit,
>
> Each PTE defines a 4kB region of address space (ie one page).
> Each PMD is a 4kB allocation with 8-byte entries, so covers 512 * 4kB = 2MB

No, PMD is 4kb allocation with 4-byte entries, so covers 1024 * 4kb = 4MB
We always us 4-byte entries, for 32- and 64-bit kernels.

> Each PGD is an 8kB allocation with 4-byte entries, so covers 2048 * 2M = 4GB

No. each PGD is a 4kb allocation with 4-byte entries. so covers 1024 * 4MB = 4GB
Still, my calculation ends up with 4GB, like yours.

> The top-level allocation is a 32kB allocation, but the first 8kB is used
> for the first PGD, so it covers 24kB / 4 bytes * 4GB = 24TB.

size of PGD (swapper_pg_dir) is 8k, so we have 8k / 4 bytes * 4GB = 8 TB
virtual address space.

At boot we want to map (1 << KERNEL_INITIAL_ORDER) pages (=64MB on 64bit kernel)
and for this pmd0 gets pre-allocated with 8k size, and pg0 with 132k to
simplify the filling the initial page tables - but that's not relevant for
the calculations above.

> I think the top level allocation was supposed to be an order-2 allocation,
> which would be an 8TB address space, but it's order-3.
>
> There's a lot of commentary which disagrees with the code.  For example,
>
> #define PMD_ORDER       1 /* Number of pages per pmd */
> That's just not true; an order-1 allocation is 2 pages, not 1.

Yes, that should be fixed up.

Helge

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Page tables on machines with >2GB RAM
  2020-10-04  5:29       ` Helge Deller
@ 2020-10-04 12:22         ` Matthew Wilcox
  0 siblings, 0 replies; 6+ messages in thread
From: Matthew Wilcox @ 2020-10-04 12:22 UTC (permalink / raw)
  To: Helge Deller; +Cc: linux-parisc

On Sun, Oct 04, 2020 at 07:29:33AM +0200, Helge Deller wrote:
> On 9/29/20 8:14 PM, Matthew Wilcox wrote:
> > It's talking about 8TB of virtual address space.  But I think it's wrong.
> > On 64-bit,
> >
> > Each PTE defines a 4kB region of address space (ie one page).
> > Each PMD is a 4kB allocation with 8-byte entries, so covers 512 * 4kB = 2MB
> 
> No, PMD is 4kb allocation with 4-byte entries, so covers 1024 * 4kb = 4MB
> We always us 4-byte entries, for 32- and 64-bit kernels.

#if CONFIG_PGTABLE_LEVELS == 3
#define PGD_ORDER       1 /* Number of pages per pgd */
#define PMD_ORDER       1 /* Number of pages per pmd */
#define PGD_ALLOC_ORDER (2 + 1) /* first pgd contains pmd */
...
#if CONFIG_PGTABLE_LEVELS == 3
...
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
        return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
}

We're definitely doing an 8kB allocation.  If we should be allocating
4kB, then PMD_ORDER should be 0.

The 32-bit entries, even on 64-bit are a nice hack.  I think that just means
we're over-allocating memory for the page tables.

> > Each PGD is an 8kB allocation with 4-byte entries, so covers 2048 * 2M = 4GB
> 
> No. each PGD is a 4kb allocation with 4-byte entries. so covers 1024 * 4MB = 4GB
> Still, my calculation ends up with 4GB, like yours.

Again, I think there's an order vs count confusion here.

> > The top-level allocation is a 32kB allocation, but the first 8kB is used
> > for the first PGD, so it covers 24kB / 4 bytes * 4GB = 24TB.
> 
> size of PGD (swapper_pg_dir) is 8k, so we have 8k / 4 bytes * 4GB = 8 TB
> virtual address space.
> 
> At boot we want to map (1 << KERNEL_INITIAL_ORDER) pages (=64MB on 64bit kernel)
> and for this pmd0 gets pre-allocated with 8k size, and pg0 with 132k to
> simplify the filling the initial page tables - but that's not relevant for
> the calculations above.

I was talking about pgd_alloc():

        pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
                                               PGD_ALLOC_ORDER);

where we allocate 8 * 4kB pages

> > I think the top level allocation was supposed to be an order-2 allocation,
> > which would be an 8TB address space, but it's order-3.
> >
> > There's a lot of commentary which disagrees with the code.  For example,
> >
> > #define PMD_ORDER       1 /* Number of pages per pmd */
> > That's just not true; an order-1 allocation is 2 pages, not 1.
> 
> Yes, that should be fixed up.
> 
> Helge

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-10-04 12:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-29 15:33 Page tables on machines with >2GB RAM Matthew Wilcox
2020-09-29 17:01 ` Matthew Wilcox
2020-09-29 17:26   ` John David Anglin
2020-09-29 18:14     ` Matthew Wilcox
2020-10-04  5:29       ` Helge Deller
2020-10-04 12:22         ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).