The patchset fixes issues with LDT remap for PTI: - Layout collision due to KASLR with 5-level paging; - Information leak via Meltdown-like attack; Please review and consider applying. v2: - Rebase to the Linus' tree + fix conflict with new documentation of kernel memory layout + fix few mistakes in layout documentation - Fix typo in commit message Kirill A. Shutemov (2): x86/mm: Move LDT remap out of KASLR region on 5-level paging x86/ldt: Unmap PTEs for the slot before freeing LDT pages Documentation/x86/x86_64/mm.txt | 34 +++++++------- arch/x86/include/asm/page_64_types.h | 12 ++--- arch/x86/include/asm/pgtable_64_types.h | 4 +- arch/x86/kernel/ldt.c | 59 ++++++++++++++++--------- arch/x86/xen/mmu_pv.c | 6 +-- 5 files changed, 67 insertions(+), 48 deletions(-) -- 2.19.1
On 5-level paging LDT remap area is placed in the middle of KASLR randomization region and it can overlap with direct mapping, vmalloc or vmap area. Let's move LDT just before direct mapping which makes it safe for KASLR. This also allows us to unify layout between 4- and 5-level paging. We don't touch 4 pgd slot gap just before the direct mapping reserved for a hypervisor, but move direct mapping by one slot instead. The LDT mapping is per-mm, so we cannot move it into P4D page table next to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level paging. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: f55f0501cbf6 ("x86/pti: Put the LDT in its own PGD if PTI is on") --- Documentation/x86/x86_64/mm.txt | 34 +++++++++++++------------ arch/x86/include/asm/page_64_types.h | 12 +++++---- arch/x86/include/asm/pgtable_64_types.h | 4 +-- arch/x86/xen/mmu_pv.c | 6 ++--- 4 files changed, 29 insertions(+), 27 deletions(-) diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 702898633b00..75bff98928a8 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________ ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor - ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) - ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole + ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI + ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base) + ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base) ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base) ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | LDT remap for PTI - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks __________________|____________|__________________|_________|____________________________________________________________ | - | Identical layout to the 47-bit one from here on: + | Identical layout to the 56-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole @@ -83,7 +84,7 @@ Notes: __________________|____________|__________________|_________|___________________________________________________________ | | | | 0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB + | | | | virtual memory addresses up to the -64 PB | | | | starting offset of kernel mappings. __________________|____________|__________________|_________|___________________________________________________________ | @@ -91,23 +92,24 @@ __________________|____________|__________________|_________|___________________ ____________________________________________________________|___________________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor - ff10000000000000 | -60 PB | ff8fffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) - ff90000000000000 | -28 PB | ff9fffffffffffff | 4 PB | LDT remap for PTI + ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI + ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base) + ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base) ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base) ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory - fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole - | | | | vaddr_end for KASLR - fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping - fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole - ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks __________________|____________|__________________|_________|____________________________________________________________ | | Identical layout to the 47-bit one from here on: ____________________________________________________________|____________________________________________________________ | | | | + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole + | | | | vaddr_end for KASLR + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h index cd0cf1c568b4..8f657286d599 100644 --- a/arch/x86/include/asm/page_64_types.h +++ b/arch/x86/include/asm/page_64_types.h @@ -33,12 +33,14 @@ /* * Set __PAGE_OFFSET to the most negative possible address + - * PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a - * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's - * what Xen requires. + * PGDIR_SIZE*17 (pgd slot 273). + * + * The gap is to allow a space for LDT remap for PTI (1 pgd slot) and space for + * a hypervisor (16 slots). Choosing 16 slots for a hypervisor is arbitrary, + * but it's what Xen requires. */ -#define __PAGE_OFFSET_BASE_L5 _AC(0xff10000000000000, UL) -#define __PAGE_OFFSET_BASE_L4 _AC(0xffff880000000000, UL) +#define __PAGE_OFFSET_BASE_L5 _AC(0xff11000000000000, UL) +#define __PAGE_OFFSET_BASE_L4 _AC(0xffff888000000000, UL) #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT #define __PAGE_OFFSET page_offset_base diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h index 04edd2d58211..84bd9bdc1987 100644 --- a/arch/x86/include/asm/pgtable_64_types.h +++ b/arch/x86/include/asm/pgtable_64_types.h @@ -111,9 +111,7 @@ extern unsigned int ptrs_per_p4d; */ #define MAXMEM (1UL << MAX_PHYSMEM_BITS) -#define LDT_PGD_ENTRY_L4 -3UL -#define LDT_PGD_ENTRY_L5 -112UL -#define LDT_PGD_ENTRY (pgtable_l5_enabled() ? LDT_PGD_ENTRY_L5 : LDT_PGD_ENTRY_L4) +#define LDT_PGD_ENTRY -240UL #define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) #define LDT_END_ADDR (LDT_BASE_ADDR + PGDIR_SIZE) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 70ea598a37d2..7a2a74c2dd30 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1905,7 +1905,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) init_top_pgt[0] = __pgd(0); /* Pre-constructed entries are in pfn, so convert to mfn */ - /* L4[272] -> level3_ident_pgt */ + /* L4[273] -> level3_ident_pgt */ /* L4[511] -> level3_kernel_pgt */ convert_pfn_mfn(init_top_pgt); @@ -1925,8 +1925,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) addr[0] = (unsigned long)pgd; addr[1] = (unsigned long)l3; addr[2] = (unsigned long)l2; - /* Graft it onto L4[272][0]. Note that we creating an aliasing problem: - * Both L4[272][0] and L4[511][510] have entries that point to the same + /* Graft it onto L4[273][0]. Note that we creating an aliasing problem: + * Both L4[273][0] and L4[511][510] have entries that point to the same * L2 (PMD) tables. Meaning that if you modify it in __va space * it will be also modified in the __ka space! (But if you just * modify the PMD table to point to other PTE's or none, then you -- 2.19.1
modify_ldt(2) leaves old LDT mapped after we switch over to the new one. Memory for the old LDT gets freed and the pages can be re-used. Leaving the mapping in place can have security implications. The mapping is present in userspace copy of page tables and Meltdown-like attack can read these freed and possibly reused pages. It's relatively simple to fix: just unmap the old LDT and flush TLB before freeing LDT memory. We can now avoid flushing TLB on map_ldt_struct() as the slot is unmapped and flushed by unmap_ldt_struct() (or never mapped in the first place). The overhead of the change should be negligible. It shouldn't be a particularly hot path anyway. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: f55f0501cbf6 ("x86/pti: Put the LDT in its own PGD if PTI is on") --- arch/x86/kernel/ldt.c | 59 ++++++++++++++++++++++++++++--------------- 1 file changed, 38 insertions(+), 21 deletions(-) diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c index ab18e0884dc6..5dc8ed202fa8 100644 --- a/arch/x86/kernel/ldt.c +++ b/arch/x86/kernel/ldt.c @@ -199,14 +199,6 @@ static void sanity_check_ldt_mapping(struct mm_struct *mm) /* * If PTI is enabled, this maps the LDT into the kernelmode and * usermode tables for the given mm. - * - * There is no corresponding unmap function. Even if the LDT is freed, we - * leave the PTEs around until the slot is reused or the mm is destroyed. - * This is harmless: the LDT is always in ordinary memory, and no one will - * access the freed slot. - * - * If we wanted to unmap freed LDTs, we'd also need to do a flush to make - * it useful, and the flush would slow down modify_ldt(). */ static int map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot) @@ -214,8 +206,7 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot) unsigned long va; bool is_vmalloc; spinlock_t *ptl; - pgd_t *pgd; - int i; + int i, nr_pages; if (!static_cpu_has(X86_FEATURE_PTI)) return 0; @@ -229,16 +220,10 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot) /* Check if the current mappings are sane */ sanity_check_ldt_mapping(mm); - /* - * Did we already have the top level entry allocated? We can't - * use pgd_none() for this because it doens't do anything on - * 4-level page table kernels. - */ - pgd = pgd_offset(mm, LDT_BASE_ADDR); - is_vmalloc = is_vmalloc_addr(ldt->entries); - for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) { + nr_pages = DIV_ROUND_UP(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE); + for (i = 0; i < nr_pages; i++) { unsigned long offset = i << PAGE_SHIFT; const void *src = (char *)ldt->entries + offset; unsigned long pfn; @@ -272,13 +257,39 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot) /* Propagate LDT mapping to the user page-table */ map_ldt_struct_to_user(mm); - va = (unsigned long)ldt_slot_va(slot); - flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, PAGE_SHIFT, false); - ldt->slot = slot; return 0; } +static void +unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt) +{ + unsigned long va; + int i, nr_pages; + + if (!ldt) + return; + + /* LDT map/unmap is only required for PTI */ + if (!static_cpu_has(X86_FEATURE_PTI)) + return; + + nr_pages = DIV_ROUND_UP(ldt->nr_entries * LDT_ENTRY_SIZE, PAGE_SIZE); + for (i = 0; i < nr_pages; i++) { + unsigned long offset = i << PAGE_SHIFT; + pte_t *ptep; + spinlock_t *ptl; + + va = (unsigned long)ldt_slot_va(ldt->slot) + offset; + ptep = get_locked_pte(mm, va, &ptl); + pte_clear(mm, va, ptep); + pte_unmap_unlock(ptep, ptl); + } + + va = (unsigned long)ldt_slot_va(ldt->slot); + flush_tlb_mm_range(mm, va, va + nr_pages * PAGE_SIZE, 0, false); +} + #else /* !CONFIG_PAGE_TABLE_ISOLATION */ static int @@ -286,6 +297,11 @@ map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot) { return 0; } + +static void +unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt) +{ +} #endif /* CONFIG_PAGE_TABLE_ISOLATION */ static void free_ldt_pgtables(struct mm_struct *mm) @@ -524,6 +540,7 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) } install_ldt(mm, new_ldt); + unmap_ldt_struct(mm, old_ldt); free_ldt_struct(old_ldt); error = 0; -- 2.19.1
On Wed, Oct 24, 2018 at 03:51:11PM +0300, Kirill A. Shutemov wrote: > +++ b/Documentation/x86/x86_64/mm.txt > @@ -34,23 +34,24 @@ __________________|____________|__________________|_________|___________________ > ____________________________________________________________|___________________________________________________________ > | | | | > ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor Oh good, it's been rewritten for people with 200-column screens. It's too painful to review now. This is how it looks for me, Ingo: > @@ -34,23 +34,24 @@ __________________|____________|__________________|_______ __|___________________ > ____________________________________________________________|________________ ___________________________________________ > | | | | > ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor If it were being formatted in rst so we could get a nice html view out of the conversion, I'd understand. But I don't see what we get from this hilariously verbose reformatting.
Hi Kirill, Thanks for making this patchset. I have small concerns, please see the inline comments. On 10/24/18 at 03:51pm, Kirill A. Shutemov wrote: > On 5-level paging LDT remap area is placed in the middle of > KASLR randomization region and it can overlap with direct mapping, > vmalloc or vmap area. > > Let's move LDT just before direct mapping which makes it safe for KASLR. > This also allows us to unify layout between 4- and 5-level paging. In crash utility and makedumpfile which are used to analyze system memory content, PAGE_OFFSET is hardcoded as below in non-KASLR case: #define PAGE_OFFSET_2_6_27 0xffff880000000000 Seems this time they need add another value for them. For 4-level and 5-level, since 5-level code also exist in stable kernel. Surely this doesn't matter much. > > We don't touch 4 pgd slot gap just before the direct mapping reserved > for a hypervisor, but move direct mapping by one slot instead. > > The LDT mapping is per-mm, so we cannot move it into P4D page table next > to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level > paging. Here as discussed in private thread, at the first place you also agreed to put it in p4d entry next to CPU_ENTRY_AREA, but finally you changd mind, there must be some reasons when you implemented and investigated further to find out. Could you please say more about how it will complicating PGD table allocation for 5-level paging? Or give an use case where it will complicate? Very sorry I am stupid, still don't get what's the point. Really appreciate it. Thanks Baoquan
On Thu, Oct 25, 2018 at 10:18:09AM +0800, Baoquan He wrote:
> > We don't touch 4 pgd slot gap just before the direct mapping reserved
> > for a hypervisor, but move direct mapping by one slot instead.
> >
> > The LDT mapping is per-mm, so we cannot move it into P4D page table next
> > to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level
> > paging.
>
> Here as discussed in private thread, at the first place you also agreed
> to put it in p4d entry next to CPU_ENTRY_AREA, but finally you changd
> mind, there must be some reasons when you implemented and investigated
> further to find out. Could you please say more about how it will
> complicating PGD table allocation for 5-level paging? Or give an use
> case where it will complicate?
On 5-level machine all memory starting from CPU_ENTRY_AREA (and part of
KASAN memory) is in the same P4D page table. All this memory is shared
across all processes, we just copy PGD entry -- all proceses point to the
same P4D page table. (I leave out PTI from the picture for simplicity.)
LDT is per-mm. If we would place it next to CPU_ENTRY_AREA we would need
to unshare P4D page table and create a new one on each fork and copy P4D
entries.
It's considerably more complex and would affect processes that never use
modify_ldt() at all.
Other option would be to move LDT remap *to* KASLR region for both paging
modes and make KALSR code aware about it: randomize it as we do for page
offset, vmalloc, vmap. It's probably better long term, but it's more
complex and I wanted to get backportable fix.
--
Kirill A. Shutemov
On 10/25/18 at 10:24am, Kirill A. Shutemov wrote:
> On Thu, Oct 25, 2018 at 10:18:09AM +0800, Baoquan He wrote:
> > > We don't touch 4 pgd slot gap just before the direct mapping reserved
> > > for a hypervisor, but move direct mapping by one slot instead.
> > >
> > > The LDT mapping is per-mm, so we cannot move it into P4D page table next
> > > to CPU_ENTRY_AREA without complicating PGD table allocation for 5-level
> > > paging.
> >
> > Here as discussed in private thread, at the first place you also agreed
> > to put it in p4d entry next to CPU_ENTRY_AREA, but finally you changd
> > mind, there must be some reasons when you implemented and investigated
> > further to find out. Could you please say more about how it will
> > complicating PGD table allocation for 5-level paging? Or give an use
> > case where it will complicate?
>
> On 5-level machine all memory starting from CPU_ENTRY_AREA (and part of
> KASAN memory) is in the same P4D page table. All this memory is shared
> across all processes, we just copy PGD entry -- all proceses point to the
> same P4D page table. (I leave out PTI from the picture for simplicity.)
Yes, got it, I didn't notice this, thanks a lot.