linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Documentation/mm: Initial page table documentation
@ 2023-06-05 22:10 Linus Walleij
  2023-06-05 22:52 ` Randy Dunlap
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Linus Walleij @ 2023-06-05 22:10 UTC (permalink / raw)
  To: Andrew Morton, Jonathan Corbet
  Cc: linux-mm, linux-doc, Linus Walleij, Mike Rapoport

This is based on an earlier blog post at people.kernel.org,
it describes the concepts about page tables that were hardest
for me to grasp when dealing with them for the first time,
such as the prevalent three-letter acronyms pfn, pgd, p4d,
pud, pmd and pte.

I don't know if this is what people want, but it's what I would
have wanted.

I discussed at one point with Mike Rapoport to bring this into
the kernel documentation, so here is a small proposal.

Cc: Mike Rapoport <rppt@kernel.org>
Link: https://people.kernel.org/linusw/arm32-page-tables
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
---
 Documentation/mm/page_tables.rst | 125 +++++++++++++++++++++++++++++++
 1 file changed, 125 insertions(+)

diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst
index 96939571d7bc..a2e1671a0f1d 100644
--- a/Documentation/mm/page_tables.rst
+++ b/Documentation/mm/page_tables.rst
@@ -3,3 +3,128 @@
 ===========
 Page Tables
 ===========
+
+Paged virtual memory was invented along with virtual memory as a concept in
+1962 on the Ferranti Atlas Computer which was the first computer with paged
+virtual memory. The feature migrated to newer computers and became a de facto
+feature of all Unix-like systems as time went by. In 1985 the feature was
+included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
+
+The first computers with virtual memory had one single page table, but the
+increased size of physical memories demanded that the page tables be split in
+two hierarchical levels. This happens because a single page table cannot cover
+the desired amount of memory with the desired granualarity, such as a page size
+of 4KB.
+
+The physical address corresponding to the virtual address is commonly
+defined by the index point in the hierarchy, and this is called a **page frame
+number** or **pfn**. The first entry on the top level to the first entry in the
+second and so on down the hierarchy will point out the virtual address for the
+physical memory address 0, which will be *pfn 0* and the highest pfn will be
+the last page of physical memory the external address bus of the CPU can
+address.
+
+With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
+address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at 0x00008000
+and so on until we reach pfn 0x3ffff at 0xffffc000.
+
+As you can see, with 4KB pages the page base address uses bits 12-31 of the
+address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
+`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
+
+Over time a deeper hierarchy has been developed in response to increasing memory
+sizes. When Linux was created, 4KB pages and a single page table called
+`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
+the fact that Torvald's first computer had 4MB of physical memory. Entries in
+this single table was referred to as *PTE*:s - page table entries.
+
+Over time the page table hierarchy has developed into this::
+
+  +-----+
+  | PGD |
+  +-----+
+     ^
+     |   +-----+
+     +---| P4D |
+         +-----+
+            ^
+            |   +-----+
+            +---| PUD |
+                +-----+
+                   ^
+                   |   +-----+
+                   +---| PMD |
+                       +-----+
+                          ^
+                          |   +-----+
+                          +---| PTE |
+                              +-----+
+
+
+Symbols on the different levels of the page table hierarchy have the following
+meaning:
+
+- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
+  main page table handling the PGD for the kernel memory is still found in
+  `swapper_pg_dir`, but each userspace process in the system also has its own
+  memory context and thus its own *pgd*, found in `struct mm_struct` which
+  in turn is referenced to in each `struct task_struct`. So tasks have memory
+  context in the form of a `struct mm_struct` and this in turn has a
+  `struct pgt_t *pgd` pointer to the corresponding page global directory.
+
+- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
+  handle 5-level page tables after the *pud* was introduced. Now it was clear
+  that we nee to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
+  directory level and that we cannot go on with ad hoc names any more. This
+  is only used on systems which actually have 5 levels of page tables.
+
+- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
+  the other levels to handle 4-level page tables. Like *p4d*, it is potentially
+  unused.
+
+- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**.
+
+- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
+  The name is a bit confusing because while in Linux 1.0 this did refer to a
+  single page table entry in the top level page table, it was retrofitted
+  to be "what the level above points to". So when two-level page tables were
+  introduced, the *pte* became a list of pointers, which is why
+  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly confusing.
+
+As already mentioned, each level in the page table hierarchy is a *list of
+pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the next level
+below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and so on. The
+number of pointers on each level is architecture-defined. The most usual layout
+is the `PAGE_SIZE` of the system divided by the number of bytes in a virtual
+address on the system so each page table level is exactly one page worth of
+pointers, which is usually what computer architects choose::
+
+    PMD
+  +-----+           PTE
+  | ptr |-------> +-----+
+  | ptr |-        | ptr |-------> PAGE
+  | ptr | \       | ptr |
+  | ptr |  \        ...
+  | ... |   \
+  | ptr |    \         PTE
+  +-----+     +----> +-----+
+                     | ptr |-------> PAGE
+                     | ptr |
+                       ...
+
+
+Each pointer in the lowest level of the page table hierarchy, i.e. each
+`pteval_t`-entry of the `PTRS_PER_PTE` entries in a `pte_t *`, will map exactly
+one `PAGE_SIZE`:d page of physical memory to exactly one page of virtual memory.
+
+The pte page table entries (pointers) on the lowest level of the hierarchy
+typically contain the high bits of a virtual address in its high bits, and in
+the lower bits it contains architecture-dependent control bits pertaining to
+the page.
+
+If the architecture does not use all the page table levels, they can be *folded*
+which means skipped, and all operations performed on page tables will be
+compile-time augmented to just skip a level when accessing the next lower
+level. Page table handling code that wish to be architecture-neutral, such as
+the virtual memory manager, will however need to be written so that it
+traverses all of the currently five levels.
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-05 22:10 [PATCH] Documentation/mm: Initial page table documentation Linus Walleij
@ 2023-06-05 22:52 ` Randy Dunlap
  2023-06-06  3:57 ` Matthew Wilcox
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Randy Dunlap @ 2023-06-05 22:52 UTC (permalink / raw)
  To: Linus Walleij, Andrew Morton, Jonathan Corbet
  Cc: linux-mm, linux-doc, Mike Rapoport



On 6/5/23 15:10, Linus Walleij wrote:
> This is based on an earlier blog post at people.kernel.org,
> it describes the concepts about page tables that were hardest
> for me to grasp when dealing with them for the first time,
> such as the prevalent three-letter acronyms pfn, pgd, p4d,
> pud, pmd and pte.
> 
> I don't know if this is what people want, but it's what I would
> have wanted.
> 
> I discussed at one point with Mike Rapoport to bring this into
> the kernel documentation, so here is a small proposal.
> 
> Cc: Mike Rapoport <rppt@kernel.org>
> Link: https://people.kernel.org/linusw/arm32-page-tables
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
>  Documentation/mm/page_tables.rst | 125 +++++++++++++++++++++++++++++++
>  1 file changed, 125 insertions(+)
> 
> diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst
> index 96939571d7bc..a2e1671a0f1d 100644
> --- a/Documentation/mm/page_tables.rst
> +++ b/Documentation/mm/page_tables.rst
> @@ -3,3 +3,128 @@
>  ===========
>  Page Tables
>  ===========
> +
> +Paged virtual memory was invented along with virtual memory as a concept in
> +1962 on the Ferranti Atlas Computer which was the first computer with paged
> +virtual memory. The feature migrated to newer computers and became a de facto
> +feature of all Unix-like systems as time went by. In 1985 the feature was
> +included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
> +
> +The first computers with virtual memory had one single page table, but the
> +increased size of physical memories demanded that the page tables be split in
> +two hierarchical levels. This happens because a single page table cannot cover
> +the desired amount of memory with the desired granualarity, such as a page size
> +of 4KB.
> +
> +The physical address corresponding to the virtual address is commonly
> +defined by the index point in the hierarchy, and this is called a **page frame
> +number** or **pfn**. The first entry on the top level to the first entry in the
> +second and so on down the hierarchy will point out the virtual address for the
> +physical memory address 0, which will be *pfn 0* and the highest pfn will be
> +the last page of physical memory the external address bus of the CPU can
> +address.
> +
> +With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
> +address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at 0x00008000
> +and so on until we reach pfn 0x3ffff at 0xffffc000.
> +
> +As you can see, with 4KB pages the page base address uses bits 12-31 of the
> +address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
> +`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
> +
> +Over time a deeper hierarchy has been developed in response to increasing memory
> +sizes. When Linux was created, 4KB pages and a single page table called
> +`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
> +the fact that Torvald's first computer had 4MB of physical memory. Entries in
> +this single table was referred to as *PTE*:s - page table entries.
> +
> +Over time the page table hierarchy has developed into this::
> +
> +  +-----+
> +  | PGD |
> +  +-----+
> +     ^
> +     |   +-----+
> +     +---| P4D |
> +         +-----+
> +            ^
> +            |   +-----+
> +            +---| PUD |
> +                +-----+
> +                   ^
> +                   |   +-----+
> +                   +---| PMD |
> +                       +-----+
> +                          ^
> +                          |   +-----+
> +                          +---| PTE |
> +                              +-----+
> +
> +
> +Symbols on the different levels of the page table hierarchy have the following
> +meaning:
> +
> +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
> +  main page table handling the PGD for the kernel memory is still found in
> +  `swapper_pg_dir`, but each userspace process in the system also has its own
> +  memory context and thus its own *pgd*, found in `struct mm_struct` which
> +  in turn is referenced to in each `struct task_struct`. So tasks have memory
> +  context in the form of a `struct mm_struct` and this in turn has a
> +  `struct pgt_t *pgd` pointer to the corresponding page global directory.
> +
> +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
> +  handle 5-level page tables after the *pud* was introduced. Now it was clear
> +  that we nee to replace *pgd*, *pmd*, *pud* etc with a figure indicating the

             need

> +  directory level and that we cannot go on with ad hoc names any more. This
> +  is only used on systems which actually have 5 levels of page tables.
> +
> +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
> +  the other levels to handle 4-level page tables. Like *p4d*, it is potentially
> +  unused.
> +
> +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**.
> +
> +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
> +  The name is a bit confusing because while in Linux 1.0 this did refer to a
> +  single page table entry in the top level page table, it was retrofitted
> +  to be "what the level above points to". So when two-level page tables were
> +  introduced, the *pte* became a list of pointers, which is why
> +  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly confusing.
> +
> +As already mentioned, each level in the page table hierarchy is a *list of
> +pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the next level
> +below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and so on. The
> +number of pointers on each level is architecture-defined. The most usual layout
> +is the `PAGE_SIZE` of the system divided by the number of bytes in a virtual
> +address on the system so each page table level is exactly one page worth of
> +pointers, which is usually what computer architects choose::
> +
> +    PMD
> +  +-----+           PTE
> +  | ptr |-------> +-----+
> +  | ptr |-        | ptr |-------> PAGE
> +  | ptr | \       | ptr |
> +  | ptr |  \        ...
> +  | ... |   \
> +  | ptr |    \         PTE
> +  +-----+     +----> +-----+
> +                     | ptr |-------> PAGE
> +                     | ptr |
> +                       ...
> +
> +
> +Each pointer in the lowest level of the page table hierarchy, i.e. each
> +`pteval_t`-entry of the `PTRS_PER_PTE` entries in a `pte_t *`, will map exactly
> +one `PAGE_SIZE`:d page of physical memory to exactly one page of virtual memory.
> +
> +The pte page table entries (pointers) on the lowest level of the hierarchy
> +typically contain the high bits of a virtual address in its high bits, and in
> +the lower bits it contains architecture-dependent control bits pertaining to
> +the page.
> +
> +If the architecture does not use all the page table levels, they can be *folded*
> +which means skipped, and all operations performed on page tables will be
> +compile-time augmented to just skip a level when accessing the next lower
> +level. Page table handling code that wish to be architecture-neutral, such as

                                        wishes

> +the virtual memory manager, will however need to be written so that it
> +traverses all of the currently five levels.

Thanks for the documentation.
-- 
~Randy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-05 22:10 [PATCH] Documentation/mm: Initial page table documentation Linus Walleij
  2023-06-05 22:52 ` Randy Dunlap
@ 2023-06-06  3:57 ` Matthew Wilcox
  2023-06-08  8:13   ` Linus Walleij
  2023-06-06  5:35 ` Mike Rapoport
  2023-06-08  9:31 ` Kuan-Ying Lee (李冠穎)
  3 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2023-06-06  3:57 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Andrew Morton, Jonathan Corbet, linux-mm, linux-doc, Mike Rapoport

On Tue, Jun 06, 2023 at 12:10:35AM +0200, Linus Walleij wrote:
> +Paged virtual memory was invented along with virtual memory as a concept in
> +1962 on the Ferranti Atlas Computer which was the first computer with paged
> +virtual memory. The feature migrated to newer computers and became a de facto
> +feature of all Unix-like systems as time went by. In 1985 the feature was
> +included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
> +
> +The first computers with virtual memory had one single page table, but the
> +increased size of physical memories demanded that the page tables be split in
> +two hierarchical levels. This happens because a single page table cannot cover
> +the desired amount of memory with the desired granualarity, such as a page size
> +of 4KB.

I'm not sure this is the best way to introduce the concept of the page
tables.  I might go with something more like ...

Page tables are a way to map virtual addresses to physical addresses.
While hardware architectures have many different ways of handling this,
Linux uses hierarchical tables, currently defined to be five levels in
height.  Architecture code takes care of mapping these software page
tables to whatever hardware requires on a given platform.

> +The physical address corresponding to the virtual address is commonly
> +defined by the index point in the hierarchy, and this is called a **page frame
> +number** or **pfn**. The first entry on the top level to the first entry in the
> +second and so on down the hierarchy will point out the virtual address for the
> +physical memory address 0, which will be *pfn 0* and the highest pfn will be
> +the last page of physical memory the external address bus of the CPU can
> +address.

This reads backwards to me.  The index point in the hierarchy (what an
unusual turn of phrase!) is surely the virtual address, since the
hierarchy is indexed by virtual addresses.  If this paragraph is
supposed to define what a pfn is, how about simply:

The pfn of a page of memory is the physical address of the page divided
by PAGE_SIZE

> +With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
> +address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at 0x00008000
> +and so on until we reach pfn 0x3ffff at 0xffffc000.

Good example, keep that.

> +Over time the page table hierarchy has developed into this::
> +
> +  +-----+
> +  | PGD |
> +  +-----+
> +     ^
> +     |   +-----+
> +     +---| P4D |
> +         +-----+
> +            ^
> +            |   +-----+
> +            +---| PUD |
> +                +-----+
> +                   ^
> +                   |   +-----+
> +                   +---| PMD |
> +                       +-----+
> +                          ^
> +                          |   +-----+
> +                          +---| PTE |
> +                              +-----+

Your arrows are backwards.  The PTE doesn't point to the PMD; the PMD
points to PTEs.

> +
> +Symbols on the different levels of the page table hierarchy have the following
> +meaning:
> +
> +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
> +  main page table handling the PGD for the kernel memory is still found in
> +  `swapper_pg_dir`, but each userspace process in the system also has its own
> +  memory context and thus its own *pgd*, found in `struct mm_struct` which
> +  in turn is referenced to in each `struct task_struct`. So tasks have memory
> +  context in the form of a `struct mm_struct` and this in turn has a
> +  `struct pgt_t *pgd` pointer to the corresponding page global directory.
> +
> +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
> +  handle 5-level page tables after the *pud* was introduced. Now it was clear
> +  that we nee to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
> +  directory level and that we cannot go on with ad hoc names any more. This
> +  is only used on systems which actually have 5 levels of page tables.
> +
> +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
> +  the other levels to handle 4-level page tables. Like *p4d*, it is potentially
> +  unused.

You have rather too many forward references in this description for my
taste.  Start with the PTE, then the PMD, then  PUD, P4D, PGD.

> +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**.
> +
> +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
> +  The name is a bit confusing because while in Linux 1.0 this did refer to a
> +  single page table entry in the top level page table, it was retrofitted
> +  to be "what the level above points to". So when two-level page tables were
> +  introduced, the *pte* became a list of pointers, which is why
> +  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly confusing.

I don't think this is right.  PTRS_PER_PTE is how many pointers are in
the PMD page table, so it's how many pointers you can walk if you have a
pte *.  Yes, it's complicated and confusing, but I don't think this
explanation clears up any of that confusion.

> +As already mentioned, each level in the page table hierarchy is a *list of

array, not list

> +pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the next level
> +below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and so on. The
> +number of pointers on each level is architecture-defined. The most usual layout

I don't think it's helpful to say this.  It's really not that usual
(maybe half of our architectures behave that way?)


I think a document like this that talks about page tables really needs to
include a description of how some PMDs / PUDs / ... may not be pointers
to lower levels, but direct pointers to the actual memory (ie THPs /
hugetlb pages).


Sorry to take a wrecking ball to this, I'm sure you worked hard on it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-05 22:10 [PATCH] Documentation/mm: Initial page table documentation Linus Walleij
  2023-06-05 22:52 ` Randy Dunlap
  2023-06-06  3:57 ` Matthew Wilcox
@ 2023-06-06  5:35 ` Mike Rapoport
  2023-06-08  9:31 ` Kuan-Ying Lee (李冠穎)
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2023-06-06  5:35 UTC (permalink / raw)
  To: Linus Walleij; +Cc: Andrew Morton, Jonathan Corbet, linux-mm, linux-doc

Hi Linus,

On Tue, Jun 06, 2023 at 12:10:35AM +0200, Linus Walleij wrote:
> This is based on an earlier blog post at people.kernel.org,
> it describes the concepts about page tables that were hardest
> for me to grasp when dealing with them for the first time,
> such as the prevalent three-letter acronyms pfn, pgd, p4d,
> pud, pmd and pte.
> 
> I don't know if this is what people want, but it's what I would
> have wanted.
> 
> I discussed at one point with Mike Rapoport to bring this into
> the kernel documentation, so here is a small proposal.
 
Thanks for the documentation. And I love asciiart :)

> Cc: Mike Rapoport <rppt@kernel.org>
> Link: https://people.kernel.org/linusw/arm32-page-tables
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
>  Documentation/mm/page_tables.rst | 125 +++++++++++++++++++++++++++++++
>  1 file changed, 125 insertions(+)
> 
> diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst
> index 96939571d7bc..a2e1671a0f1d 100644
> --- a/Documentation/mm/page_tables.rst
> +++ b/Documentation/mm/page_tables.rst
> @@ -3,3 +3,128 @@
>  ===========
>  Page Tables
>  ===========
> +
> +Paged virtual memory was invented along with virtual memory as a concept in
> +1962 on the Ferranti Atlas Computer which was the first computer with paged
> +virtual memory. The feature migrated to newer computers and became a de facto
> +feature of all Unix-like systems as time went by. In 1985 the feature was
> +included in the Intel 80386, which was the CPU Linux 1.0 was developed on.
> +
> +The first computers with virtual memory had one single page table, but the
> +increased size of physical memories demanded that the page tables be split in
> +two hierarchical levels. This happens because a single page table cannot cover
> +the desired amount of memory with the desired granualarity, such as a page size
> +of 4KB.
> +
> +The physical address corresponding to the virtual address is commonly
> +defined by the index point in the hierarchy, and this is called a **page frame
> +number** or **pfn**. The first entry on the top level to the first entry in the

                                                        ^ points?

I'd add this sentence before "The first entry":
A virtual address is split to indexes into every table in the page table
hierarchy.

> +second and so on down the hierarchy will point out the virtual address for the

Maybe something like:

... so on down the hierarchy so that virtual address that has all indexes
as 0 will point out to physical memory address 0 ...

> +physical memory address 0, which will be *pfn 0* and the highest pfn will be
> +the last page of physical memory the external address bus of the CPU can
> +address.
> +
> +With a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
> +address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at 0x00008000
> +and so on until we reach pfn 0x3ffff at 0xffffc000.
> +
> +As you can see, with 4KB pages the page base address uses bits 12-31 of the
> +address, and this is why `PAGE_SHIFT` in this case is defined as 12 and
> +`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
> +
> +Over time a deeper hierarchy has been developed in response to increasing memory
> +sizes. When Linux was created, 4KB pages and a single page table called
> +`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
> +the fact that Torvald's first computer had 4MB of physical memory. Entries in
> +this single table was referred to as *PTE*:s - page table entries.
> +
> +Over time the page table hierarchy has developed into this::
> +
> +  +-----+
> +  | PGD |
> +  +-----+
> +     ^
> +     |   +-----+
> +     +---| P4D |
> +         +-----+
> +            ^
> +            |   +-----+
> +            +---| PUD |
> +                +-----+
> +                   ^
> +                   |   +-----+
> +                   +---| PMD |
> +                       +-----+
> +                          ^
> +                          |   +-----+
> +                          +---| PTE |
> +                              +-----+
> +
> +
> +Symbols on the different levels of the page table hierarchy have the following
> +meaning:
> +
> +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
> +  main page table handling the PGD for the kernel memory is still found in
> +  `swapper_pg_dir`, but each userspace process in the system also has its own
> +  memory context and thus its own *pgd*, found in `struct mm_struct` which
> +  in turn is referenced to in each `struct task_struct`. So tasks have memory
> +  context in the form of a `struct mm_struct` and this in turn has a
> +  `struct pgt_t *pgd` pointer to the corresponding page global directory.
> +
> +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
> +  handle 5-level page tables after the *pud* was introduced. Now it was clear
> +  that we nee to replace *pgd*, *pmd*, *pud* etc with a figure indicating the

            ^ need

> +  directory level and that we cannot go on with ad hoc names any more. This
> +  is only used on systems which actually have 5 levels of page tables.
> +
> +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
> +  the other levels to handle 4-level page tables. Like *p4d*, it is potentially
> +  unused.
> +
> +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**.
> +
> +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
> +  The name is a bit confusing because while in Linux 1.0 this did refer to a
> +  single page table entry in the top level page table, it was retrofitted
> +  to be "what the level above points to". So when two-level page tables were
> +  introduced, the *pte* became a list of pointers, which is why
> +  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly confusing.
> +
> +As already mentioned, each level in the page table hierarchy is a *list of
> +pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the next level
> +below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and so on. The
> +number of pointers on each level is architecture-defined. The most usual layout
> +is the `PAGE_SIZE` of the system divided by the number of bytes in a virtual
> +address on the system so each page table level is exactly one page worth of
> +pointers, which is usually what computer architects choose::
> +
> +    PMD
> +  +-----+           PTE
> +  | ptr |-------> +-----+
> +  | ptr |-        | ptr |-------> PAGE
> +  | ptr | \       | ptr |
> +  | ptr |  \        ...
> +  | ... |   \
> +  | ptr |    \         PTE
> +  +-----+     +----> +-----+
> +                     | ptr |-------> PAGE
> +                     | ptr |
> +                       ...
> +
> +
> +Each pointer in the lowest level of the page table hierarchy, i.e. each
> +`pteval_t`-entry of the `PTRS_PER_PTE` entries in a `pte_t *`, will map exactly
> +one `PAGE_SIZE`:d page of physical memory to exactly one page of virtual memory.
> +
> +The pte page table entries (pointers) on the lowest level of the hierarchy
> +typically contain the high bits of a virtual address in its high bits, and in
> +the lower bits it contains architecture-dependent control bits pertaining to
> +the page.

... typically contain PFN in their high bits and architecture-dependent
control bits in the lower bits.

> +
> +If the architecture does not use all the page table levels, they can be *folded*
> +which means skipped, and all operations performed on page tables will be
> +compile-time augmented to just skip a level when accessing the next lower
> +level. Page table handling code that wish to be architecture-neutral, such as
> +the virtual memory manager, will however need to be written so that it
> +traverses all of the currently five levels.

I'd add something like

And even architecture specific page table traversals are better off with
using all the levels for better robustness against future changes.

> -- 
> 2.40.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-06  3:57 ` Matthew Wilcox
@ 2023-06-08  8:13   ` Linus Walleij
  2023-06-08  9:00     ` Mike Rapoport
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Walleij @ 2023-06-08  8:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Jonathan Corbet, linux-mm, linux-doc, Mike Rapoport

Hi Matthew,

I fixes up most of the comments.

On Tue, Jun 6, 2023 at 5:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Jun 06, 2023 at 12:10:35AM +0200, Linus Walleij wrote:

> > +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
> > +  The name is a bit confusing because while in Linux 1.0 this did refer to a
> > +  single page table entry in the top level page table, it was retrofitted
> > +  to be "what the level above points to". So when two-level page tables were
> > +  introduced, the *pte* became a list of pointers, which is why
> > +  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly confusing.
>
> I don't think this is right.  PTRS_PER_PTE is how many pointers are in
> the PMD page table,

I don't get this. What does PTRS_PER_PMD mean then (and
then all the way up to PTRS_PER_PGD...)

> so it's how many pointers you can walk if you have a
> pte *.  Yes, it's complicated and confusing, but I don't think this
> explanation clears up any of that confusion.

I will try to reword it so this gets through.

> > +pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the next level
> > +below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and so on. The
> > +number of pointers on each level is architecture-defined. The most usual layout
>
> I don't think it's helpful to say this.  It's really not that usual
> (maybe half of our architectures behave that way?)
>
> I think a document like this that talks about page tables really needs to
> include a description of how some PMDs / PUDs / ... may not be pointers
> to lower levels, but direct pointers to the actual memory (ie THPs /
> hugetlb pages).

I don't understand that stuff. I suggest you patch this into the document
when the basics are in place.

> Sorry to take a wrecking ball to this, I'm sure you worked hard on it.

Don't worry about that, I'm an academic, I just rewrite.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-08  8:13   ` Linus Walleij
@ 2023-06-08  9:00     ` Mike Rapoport
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Rapoport @ 2023-06-08  9:00 UTC (permalink / raw)
  To: Linus Walleij
  Cc: Matthew Wilcox, Andrew Morton, Jonathan Corbet, linux-mm, linux-doc

On Thu, Jun 08, 2023 at 10:13:49AM +0200, Linus Walleij wrote:
> Hi Matthew,
> 
> I fixes up most of the comments.
> 
> On Tue, Jun 6, 2023 at 5:57 AM Matthew Wilcox <willy@infradead.org> wrote:
> > On Tue, Jun 06, 2023 at 12:10:35AM +0200, Linus Walleij wrote:
> 
> > > +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
> > > +  The name is a bit confusing because while in Linux 1.0 this did refer to a
> > > +  single page table entry in the top level page table, it was retrofitted
> > > +  to be "what the level above points to". So when two-level page tables were
> > > +  introduced, the *pte* became a list of pointers, which is why
> > > +  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly confusing.
> >
> > I don't think this is right.  PTRS_PER_PTE is how many pointers are in
> > the PMD page table,
> 
> I don't get this. What does PTRS_PER_PMD mean then (and
> then all the way up to PTRS_PER_PGD...)

PTRS_PER_PTE is how many pointers in the lowest level (pte) page table and
pte_t is a "pointer" to an actual physical page mapped by the page tables.
 
> Yours,
> Linus Walleij

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-05 22:10 [PATCH] Documentation/mm: Initial page table documentation Linus Walleij
                   ` (2 preceding siblings ...)
  2023-06-06  5:35 ` Mike Rapoport
@ 2023-06-08  9:31 ` Kuan-Ying Lee (李冠穎)
  2023-06-08 11:51   ` Linus Walleij
  3 siblings, 1 reply; 8+ messages in thread
From: Kuan-Ying Lee (李冠穎) @ 2023-06-08  9:31 UTC (permalink / raw)
  To: corbet, linus.walleij, akpm; +Cc: linux-mm, rppt, linux-doc

On Tue, 2023-06-06 at 00:10 +0200, Linus Walleij wrote:
> This is based on an earlier blog post at people.kernel.org,
> it describes the concepts about page tables that were hardest
> for me to grasp when dealing with them for the first time,
> such as the prevalent three-letter acronyms pfn, pgd, p4d,
> pud, pmd and pte.
> 
> I don't know if this is what people want, but it's what I would
> have wanted.
> 
> I discussed at one point with Mike Rapoport to bring this into
> the kernel documentation, so here is a small proposal.
> 
> Cc: Mike Rapoport <rppt@kernel.org>
> Link: https://people.kernel.org/linusw/arm32-page-tables
> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
> ---
>  Documentation/mm/page_tables.rst | 125
> +++++++++++++++++++++++++++++++
>  1 file changed, 125 insertions(+)
> 
> diff --git a/Documentation/mm/page_tables.rst
> b/Documentation/mm/page_tables.rst
> index 96939571d7bc..a2e1671a0f1d 100644
> --- a/Documentation/mm/page_tables.rst
> +++ b/Documentation/mm/page_tables.rst
> @@ -3,3 +3,128 @@
>  ===========
>  Page Tables
>  ===========
> +
> +Paged virtual memory was invented along with virtual memory as a
> concept in
> +1962 on the Ferranti Atlas Computer which was the first computer
> with paged
> +virtual memory. The feature migrated to newer computers and became a
> de facto
> +feature of all Unix-like systems as time went by. In 1985 the
> feature was
> +included in the Intel 80386, which was the CPU Linux 1.0 was
> developed on.
> +
> +The first computers with virtual memory had one single page table,
> but the
> +increased size of physical memories demanded that the page tables be
> split in
> +two hierarchical levels. This happens because a single page table
> cannot cover
> +the desired amount of memory with the desired granualarity, such as
> a page size
> +of 4KB.
> +
> +The physical address corresponding to the virtual address is
> commonly
> +defined by the index point in the hierarchy, and this is called a
> **page frame
> +number** or **pfn**. The first entry on the top level to the first
> entry in the
> +second and so on down the hierarchy will point out the virtual
> address for the
> +physical memory address 0, which will be *pfn 0* and the highest pfn
> will be
> +the last page of physical memory the external address bus of the CPU
> can
> +address.
> +
> +With a page granularity of 4KB and a address range of 32 bits, pfn 0
> is at
> +address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at
> 0x00008000
> +and so on until we reach pfn 0x3ffff at 0xffffc000.

pfn 1 is at 0x00001000.
pfn 2 is at 0x00002000.

And so on until we reach pfn 0xfffff at 0xfffff000.

> +
> +As you can see, with 4KB pages the page base address uses bits 12-31 
> of the
> +address, and this is why `PAGE_SHIFT` in this case is defined as 12
> and
> +`PAGE_SIZE` is usually defined in terms of the page shift as `(1 <<
> PAGE_SHIFT)`
> +
> +Over time a deeper hierarchy has been developed in response to
> increasing memory
> +sizes. When Linux was created, 4KB pages and a single page table
> called
> +`swapper_pg_dir` with 1024 entries was used, covering 4MB which
> coincided with
> +the fact that Torvald's first computer had 4MB of physical memory.
> Entries in
> +this single table was referred to as *PTE*:s - page table entries.
> +
> +Over time the page table hierarchy has developed into this::
> +
> +  +-----+
> +  | PGD |
> +  +-----+
> +     ^
> +     |   +-----+
> +     +---| P4D |
> +         +-----+
> +            ^
> +            |   +-----+
> +            +---| PUD |
> +                +-----+
> +                   ^
> +                   |   +-----+
> +                   +---| PMD |
> +                       +-----+
> +                          ^
> +                          |   +-----+
> +                          +---| PTE |
> +                              +-----+
> +
> +
> +Symbols on the different levels of the page table hierarchy have the
> following
> +meaning:
> +
> +- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the
> Linux kernel
> +  main page table handling the PGD for the kernel memory is still
> found in
> +  `swapper_pg_dir`, but each userspace process in the system also
> has its own
> +  memory context and thus its own *pgd*, found in `struct mm_struct`
> which
> +  in turn is referenced to in each `struct task_struct`. So tasks
> have memory
> +  context in the form of a `struct mm_struct` and this in turn has a
> +  `struct pgt_t *pgd` pointer to the corresponding page global
> directory.
> +
> +- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was
> introduced to
> +  handle 5-level page tables after the *pud* was introduced. Now it
> was clear
> +  that we nee to replace *pgd*, *pmd*, *pud* etc with a figure
> indicating the
> +  directory level and that we cannot go on with ad hoc names any
> more. This
> +  is only used on systems which actually have 5 levels of page
> tables.
> +
> +- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was
> introduced after
> +  the other levels to handle 4-level page tables. Like *p4d*, it is
> potentially
> +  unused.
> +
> +- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**.
> +
> +- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned
> earlier.
> +  The name is a bit confusing because while in Linux 1.0 this did
> refer to a
> +  single page table entry in the top level page table, it was
> retrofitted
> +  to be "what the level above points to". So when two-level page
> tables were
> +  introduced, the *pte* became a list of pointers, which is why
> +  `PTRS_PER_PTE` exists. This oxymoronic term can be mildly
> confusing.
> +
> +As already mentioned, each level in the page table hierarchy is a
> *list of
> +pointers*, so the **pgd** contains `PTRS_PER_PGD` pointers to the
> next level
> +below, **p4d** contains `PTRS_PER_P4D` pointers to **pud** items and
> so on. The
> +number of pointers on each level is architecture-defined. The most
> usual layout
> +is the `PAGE_SIZE` of the system divided by the number of bytes in a
> virtual
> +address on the system so each page table level is exactly one page
> worth of
> +pointers, which is usually what computer architects choose::
> +
> +    PMD
> +  +-----+           PTE
> +  | ptr |-------> +-----+
> +  | ptr |-        | ptr |-------> PAGE
> +  | ptr | \       | ptr |
> +  | ptr |  \        ...
> +  | ... |   \
> +  | ptr |    \         PTE
> +  +-----+     +----> +-----+
> +                     | ptr |-------> PAGE
> +                     | ptr |
> +                       ...
> +
> +
> +Each pointer in the lowest level of the page table hierarchy, i.e.
> each
> +`pteval_t`-entry of the `PTRS_PER_PTE` entries in a `pte_t *`, will
> map exactly
> +one `PAGE_SIZE`:d page of physical memory to exactly one page of
> virtual memory.
> +
> +The pte page table entries (pointers) on the lowest level of the
> hierarchy
> +typically contain the high bits of a virtual address in its high
> bits, and in
> +the lower bits it contains architecture-dependent control bits
> pertaining to
> +the page.
> +
> +If the architecture does not use all the page table levels, they can
> be *folded*
> +which means skipped, and all operations performed on page tables
> will be
> +compile-time augmented to just skip a level when accessing the next
> lower
> +level. Page table handling code that wish to be architecture-
> neutral, such as
> +the virtual memory manager, will however need to be written so that
> it
> +traverses all of the currently five levels.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Documentation/mm: Initial page table documentation
  2023-06-08  9:31 ` Kuan-Ying Lee (李冠穎)
@ 2023-06-08 11:51   ` Linus Walleij
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Walleij @ 2023-06-08 11:51 UTC (permalink / raw)
  To: Kuan-Ying Lee (李冠穎)
  Cc: corbet, akpm, linux-mm, rppt, linux-doc

On Thu, Jun 8, 2023 at 11:32 AM Kuan-Ying Lee (李冠穎)
<Kuan-Ying.Lee@mediatek.com> wrote:

> > +With a page granularity of 4KB and a address range of 32 bits, pfn 0
> > is at
> > +address 0x00000000, pfn 1 is at address 0x00004000, pfn 2 is at
> > 0x00008000
> > +and so on until we reach pfn 0x3ffff at 0xffffc000.
>
> pfn 1 is at 0x00001000.
> pfn 2 is at 0x00002000.
>
> And so on until we reach pfn 0xfffff at 0xfffff000.

It seems I went immediately for 16K pages... Thanks, I'll fix it up.

Yours,
Linus Walleij

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-06-08 11:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-05 22:10 [PATCH] Documentation/mm: Initial page table documentation Linus Walleij
2023-06-05 22:52 ` Randy Dunlap
2023-06-06  3:57 ` Matthew Wilcox
2023-06-08  8:13   ` Linus Walleij
2023-06-08  9:00     ` Mike Rapoport
2023-06-06  5:35 ` Mike Rapoport
2023-06-08  9:31 ` Kuan-Ying Lee (李冠穎)
2023-06-08 11:51   ` Linus Walleij

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).