* Page tables on machines with >2GB RAM
@ 2020-09-29 15:33 Matthew Wilcox
2020-09-29 17:01 ` Matthew Wilcox
0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-09-29 15:33 UTC (permalink / raw)
To: linux-parisc
I think we can end up truncating a PMD or PGD entry (I get confused
easily about levels of the page tables; bear with me)
/* NOTE: even on 64 bits, these entries are __u32 because we allocate
* the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
typedef struct { __u32 pgd; } pgd_t;
...
typedef struct { __u32 pmd; } pmd_t;
...
pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
PGD_ALLOC_ORDER);
...
return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
so if we have more than 2GB of RAM, we can allocate a page with the top
bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
and mask it off, causing us to load the wrong page for the next level
of the page table walk.
Have I missed something?
Oh and I think this bug was introduced in 2004 ...
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Page tables on machines with >2GB RAM
2020-09-29 15:33 Page tables on machines with >2GB RAM Matthew Wilcox
@ 2020-09-29 17:01 ` Matthew Wilcox
2020-09-29 17:26 ` John David Anglin
0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-09-29 17:01 UTC (permalink / raw)
To: linux-parisc
On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
> I think we can end up truncating a PMD or PGD entry (I get confused
> easily about levels of the page tables; bear with me)
>
> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
> * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
> typedef struct { __u32 pgd; } pgd_t;
> ...
> typedef struct { __u32 pmd; } pmd_t;
>
> ...
>
> pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
> PGD_ALLOC_ORDER);
> ...
> return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
>
> so if we have more than 2GB of RAM, we can allocate a page with the top
> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
> and mask it off, causing us to load the wrong page for the next level
> of the page table walk.
>
> Have I missed something?
Yes, yes I have.
We store the PFN, not the physical address. So we have 28 bits for
storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
(1TB) of physical address space.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Page tables on machines with >2GB RAM
2020-09-29 17:01 ` Matthew Wilcox
@ 2020-09-29 17:26 ` John David Anglin
2020-09-29 18:14 ` Matthew Wilcox
0 siblings, 1 reply; 6+ messages in thread
From: John David Anglin @ 2020-09-29 17:26 UTC (permalink / raw)
To: Matthew Wilcox, linux-parisc
On 2020-09-29 1:01 p.m., Matthew Wilcox wrote:
> On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
>> I think we can end up truncating a PMD or PGD entry (I get confused
>> easily about levels of the page tables; bear with me)
>>
>> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
>> * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
>> typedef struct { __u32 pgd; } pgd_t;
>> ...
>> typedef struct { __u32 pmd; } pmd_t;
>>
>> ...
>>
>> pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
>> PGD_ALLOC_ORDER);
>> ...
>> return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
>>
>> so if we have more than 2GB of RAM, we can allocate a page with the top
>> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
>> and mask it off, causing us to load the wrong page for the next level
>> of the page table walk.
>>
>> Have I missed something?
> Yes, yes I have.
>
> We store the PFN, not the physical address. So we have 28 bits for
> storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
> (1TB) of physical address space.
The comment in pgalloc.h says 8TB? I think improving the description as to how this works
would be welcome.
--
John David Anglin dave.anglin@bell.net
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Page tables on machines with >2GB RAM
2020-09-29 17:26 ` John David Anglin
@ 2020-09-29 18:14 ` Matthew Wilcox
2020-10-04 5:29 ` Helge Deller
0 siblings, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2020-09-29 18:14 UTC (permalink / raw)
To: John David Anglin; +Cc: linux-parisc
On Tue, Sep 29, 2020 at 01:26:29PM -0400, John David Anglin wrote:
> On 2020-09-29 1:01 p.m., Matthew Wilcox wrote:
> > On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
> >> I think we can end up truncating a PMD or PGD entry (I get confused
> >> easily about levels of the page tables; bear with me)
> >>
> >> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
> >> * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
> >> typedef struct { __u32 pgd; } pgd_t;
> >> ...
> >> typedef struct { __u32 pmd; } pmd_t;
> >>
> >> ...
> >>
> >> pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
> >> PGD_ALLOC_ORDER);
> >> ...
> >> return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
> >>
> >> so if we have more than 2GB of RAM, we can allocate a page with the top
> >> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
> >> and mask it off, causing us to load the wrong page for the next level
> >> of the page table walk.
> >>
> >> Have I missed something?
> > Yes, yes I have.
> >
> > We store the PFN, not the physical address. So we have 28 bits for
> > storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
> > (1TB) of physical address space.
> The comment in pgalloc.h says 8TB? I think improving the description as to how this works
> would be welcome.
It's talking about 8TB of virtual address space. But I think it's wrong.
On 64-bit,
Each PTE defines a 4kB region of address space (ie one page).
Each PMD is a 4kB allocation with 8-byte entries, so covers 512 * 4kB = 2MB
Each PGD is an 8kB allocation with 4-byte entries, so covers 2048 * 2M = 4GB
The top-level allocation is a 32kB allocation, but the first 8kB is used
for the first PGD, so it covers 24kB / 4 bytes * 4GB = 24TB.
I think the top level allocation was supposed to be an order-2 allocation,
which would be an 8TB address space, but it's order-3.
There's a lot of commentary which disagrees with the code. For example,
#define PMD_ORDER 1 /* Number of pages per pmd */
That's just not true; an order-1 allocation is 2 pages, not 1.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Page tables on machines with >2GB RAM
2020-09-29 18:14 ` Matthew Wilcox
@ 2020-10-04 5:29 ` Helge Deller
2020-10-04 12:22 ` Matthew Wilcox
0 siblings, 1 reply; 6+ messages in thread
From: Helge Deller @ 2020-10-04 5:29 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: linux-parisc
On 9/29/20 8:14 PM, Matthew Wilcox wrote:
> On Tue, Sep 29, 2020 at 01:26:29PM -0400, John David Anglin wrote:
>> On 2020-09-29 1:01 p.m., Matthew Wilcox wrote:
>>> On Tue, Sep 29, 2020 at 04:33:16PM +0100, Matthew Wilcox wrote:
>>>> I think we can end up truncating a PMD or PGD entry (I get confused
>>>> easily about levels of the page tables; bear with me)
>>>>
>>>> /* NOTE: even on 64 bits, these entries are __u32 because we allocate
>>>> * the pmd and pgd in ZONE_DMA (i.e. under 4GB) */
>>>> typedef struct { __u32 pgd; } pgd_t;
>>>> ...
>>>> typedef struct { __u32 pmd; } pmd_t;
>>>>
>>>> ...
>>>>
>>>> pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
>>>> PGD_ALLOC_ORDER);
>>>> ...
>>>> return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
>>>>
>>>> so if we have more than 2GB of RAM, we can allocate a page with the top
>>>> bit set, which we interpret to mean PAGE_PRESENT in the TLB miss handler
>>>> and mask it off, causing us to load the wrong page for the next level
>>>> of the page table walk.
>>>>
>>>> Have I missed something?
>>> Yes, yes I have.
>>>
>>> We store the PFN, not the physical address. So we have 28 bits for
>>> storing the PFN and 4 bits for the PxD bits, supporting 28 + 12 = 40 bits
>>> (1TB) of physical address space.
>> The comment in pgalloc.h says 8TB? I think improving the description as to how this works
>> would be welcome.
>
> It's talking about 8TB of virtual address space. But I think it's wrong.
> On 64-bit,
>
> Each PTE defines a 4kB region of address space (ie one page).
> Each PMD is a 4kB allocation with 8-byte entries, so covers 512 * 4kB = 2MB
No, PMD is 4kb allocation with 4-byte entries, so covers 1024 * 4kb = 4MB
We always us 4-byte entries, for 32- and 64-bit kernels.
> Each PGD is an 8kB allocation with 4-byte entries, so covers 2048 * 2M = 4GB
No. each PGD is a 4kb allocation with 4-byte entries. so covers 1024 * 4MB = 4GB
Still, my calculation ends up with 4GB, like yours.
> The top-level allocation is a 32kB allocation, but the first 8kB is used
> for the first PGD, so it covers 24kB / 4 bytes * 4GB = 24TB.
size of PGD (swapper_pg_dir) is 8k, so we have 8k / 4 bytes * 4GB = 8 TB
virtual address space.
At boot we want to map (1 << KERNEL_INITIAL_ORDER) pages (=64MB on 64bit kernel)
and for this pmd0 gets pre-allocated with 8k size, and pg0 with 132k to
simplify the filling the initial page tables - but that's not relevant for
the calculations above.
> I think the top level allocation was supposed to be an order-2 allocation,
> which would be an 8TB address space, but it's order-3.
>
> There's a lot of commentary which disagrees with the code. For example,
>
> #define PMD_ORDER 1 /* Number of pages per pmd */
> That's just not true; an order-1 allocation is 2 pages, not 1.
Yes, that should be fixed up.
Helge
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Page tables on machines with >2GB RAM
2020-10-04 5:29 ` Helge Deller
@ 2020-10-04 12:22 ` Matthew Wilcox
0 siblings, 0 replies; 6+ messages in thread
From: Matthew Wilcox @ 2020-10-04 12:22 UTC (permalink / raw)
To: Helge Deller; +Cc: linux-parisc
On Sun, Oct 04, 2020 at 07:29:33AM +0200, Helge Deller wrote:
> On 9/29/20 8:14 PM, Matthew Wilcox wrote:
> > It's talking about 8TB of virtual address space. But I think it's wrong.
> > On 64-bit,
> >
> > Each PTE defines a 4kB region of address space (ie one page).
> > Each PMD is a 4kB allocation with 8-byte entries, so covers 512 * 4kB = 2MB
>
> No, PMD is 4kb allocation with 4-byte entries, so covers 1024 * 4kb = 4MB
> We always us 4-byte entries, for 32- and 64-bit kernels.
#if CONFIG_PGTABLE_LEVELS == 3
#define PGD_ORDER 1 /* Number of pages per pgd */
#define PMD_ORDER 1 /* Number of pages per pmd */
#define PGD_ALLOC_ORDER (2 + 1) /* first pgd contains pmd */
...
#if CONFIG_PGTABLE_LEVELS == 3
...
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
{
return (pmd_t *)__get_free_pages(GFP_PGTABLE_KERNEL, PMD_ORDER);
}
We're definitely doing an 8kB allocation. If we should be allocating
4kB, then PMD_ORDER should be 0.
The 32-bit entries, even on 64-bit are a nice hack. I think that just means
we're over-allocating memory for the page tables.
> > Each PGD is an 8kB allocation with 4-byte entries, so covers 2048 * 2M = 4GB
>
> No. each PGD is a 4kb allocation with 4-byte entries. so covers 1024 * 4MB = 4GB
> Still, my calculation ends up with 4GB, like yours.
Again, I think there's an order vs count confusion here.
> > The top-level allocation is a 32kB allocation, but the first 8kB is used
> > for the first PGD, so it covers 24kB / 4 bytes * 4GB = 24TB.
>
> size of PGD (swapper_pg_dir) is 8k, so we have 8k / 4 bytes * 4GB = 8 TB
> virtual address space.
>
> At boot we want to map (1 << KERNEL_INITIAL_ORDER) pages (=64MB on 64bit kernel)
> and for this pmd0 gets pre-allocated with 8k size, and pg0 with 132k to
> simplify the filling the initial page tables - but that's not relevant for
> the calculations above.
I was talking about pgd_alloc():
pgd_t *pgd = (pgd_t *)__get_free_pages(GFP_KERNEL,
PGD_ALLOC_ORDER);
where we allocate 8 * 4kB pages
> > I think the top level allocation was supposed to be an order-2 allocation,
> > which would be an 8TB address space, but it's order-3.
> >
> > There's a lot of commentary which disagrees with the code. For example,
> >
> > #define PMD_ORDER 1 /* Number of pages per pmd */
> > That's just not true; an order-1 allocation is 2 pages, not 1.
>
> Yes, that should be fixed up.
>
> Helge
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-10-04 12:22 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-29 15:33 Page tables on machines with >2GB RAM Matthew Wilcox
2020-09-29 17:01 ` Matthew Wilcox
2020-09-29 17:26 ` John David Anglin
2020-09-29 18:14 ` Matthew Wilcox
2020-10-04 5:29 ` Helge Deller
2020-10-04 12:22 ` Matthew Wilcox
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.