* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
2016-06-27 14:24 ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
@ 2016-06-27 20:08 ` Borislav Petkov
2016-06-27 23:33 ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
2016-06-30 13:17 ` [PATCH v4] " Rafael J. Wysocki
2 siblings, 0 replies; 39+ messages in thread
From: Borislav Petkov @ 2016-06-27 20:08 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Kees Cook, Logan Gunthorpe, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On Mon, Jun 27, 2016 at 04:24:22PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables. Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch. Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too. That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables. Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it). Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> arch/x86/power/hibernate_64.c | 90 ++++++++++++++++++++++++++++++++------
> arch/x86/power/hibernate_asm_64.S | 55 ++++++++++-------------
> 2 files changed, 102 insertions(+), 43 deletions(-)
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-27 14:24 ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
2016-06-27 20:08 ` Borislav Petkov
@ 2016-06-27 23:33 ` Logan Gunthorpe
2016-06-29 14:48 ` Kees Cook
2016-06-30 13:17 ` [PATCH v4] " Rafael J. Wysocki
2 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-27 23:33 UTC (permalink / raw)
To: Rafael J. Wysocki, Kees Cook, Borislav Petkov
Cc: Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar,
Peter Zijlstra, lkml, Rafael J. Wysocki, Andy Lutomirski,
Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
Stephen Smalley
Hey,
This version is working for me as well. Thanks.
Logan
On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hey Rafael,
>>>
>>> This patch appears to be working on my laptop. Thanks.
>>
>> Same for me: resume still works with KASLR in my tests too.
>
> Unfortunately, Boris still sees post-resume memory corruption with the patch
> you have tested, but that turns out to be a result of some super-weird
> corruption of a pointer on the stack which leads to a store at a wrong address
> (and there's no way it can be related to the rest of the patch).
>
> We have verified that it can be avoided by rearranging set_up_temporary_text_mapping()
> to use fewer local variables and the appended patch contains this modification.
>
> I also went on and changed relocate_restore_code() slightly in a similar fashion,
> but all of those changes should not affect the behavior (unless there's something
> insane going on behind the curtains, that is).
>
> Kees, Logan, Boris, please try this one and let me know if it works for you.
>
> Thanks,
> Rafael
>
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables. Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch. Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too. That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables. Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it). Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> arch/x86/power/hibernate_64.c | 90 ++++++++++++++++++++++++++++++++------
> arch/x86/power/hibernate_asm_64.S | 55 ++++++++++-------------
> 2 files changed, 102 insertions(+), 43 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
> * kernel's text (this value is passed in the image header).
> */
> unsigned long restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
> /*
> * Value of the cr3 register from before the hibernation (this value is passed
> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>
> pgd_t *temp_level4_pgt __visible;
>
> -void *relocated_restore_code __visible;
> +unsigned long relocated_restore_code __visible;
> +
> +static int set_up_temporary_text_mapping(void)
> +{
> + pmd_t *pmd;
> + pud_t *pud;
> +
> + /*
> + * The new mapping only has to cover the page containing the image
> + * kernel's entry point (jump_address_phys), because the switch over to
> + * it is carried out by relocated code running from a page allocated
> + * specifically for this purpose and covered by the identity mapping, so
> + * the temporary kernel text mapping is only needed for the final jump.
> + * Moreover, in that mapping the virtual address of the image kernel's
> + * entry point must be the same as its virtual address in the image
> + * kernel (restore_jump_address), so the image kernel's
> + * restore_registers() code doesn't find itself in a different area of
> + * the virtual address space after switching over to the original page
> + * tables used by the image kernel.
> + */
> + pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> + if (!pud)
> + return -ENOMEM;
> +
> + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> + if (!pmd)
> + return -ENOMEM;
> +
> + set_pmd(pmd + pmd_index(restore_jump_address),
> + __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
> + set_pud(pud + pud_index(restore_jump_address),
> + __pud(__pa(pmd) | _KERNPG_TABLE));
> + set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
> + __pgd(__pa(pud) | _KERNPG_TABLE));
> +
> + return 0;
> +}
>
> static void *alloc_pgt_page(void *context)
> {
> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
> if (!temp_level4_pgt)
> return -ENOMEM;
>
> - /* It is safe to reuse the original kernel mapping */
> - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> - init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> + /* Prepare a temporary mapping for the kernel text */
> + result = set_up_temporary_text_mapping();
> + if (result)
> + return result;
>
> /* Set up the direct mapping from scratch */
> for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
> return 0;
> }
>
> +static int relocate_restore_code(void)
> +{
> + pgd_t *pgd;
> + pmd_t *pmd;
> +
> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
> + if (!relocated_restore_code)
> + return -ENOMEM;
> +
> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> + /* Make the page containing the relocated code executable */
> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> + pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
> + relocated_restore_code);
> + if (pmd_large(*pmd)) {
> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> + } else {
> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> + }
> +
> + return 0;
> +}
> +
> int swsusp_arch_resume(void)
> {
> int error;
>
> /* We have got enough memory and from now on we cannot recover */
> - if ((error = set_up_temporary_mappings()))
> + error = set_up_temporary_mappings();
> + if (error)
> return error;
>
> - relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> - if (!relocated_restore_code)
> - return -ENOMEM;
> - memcpy(relocated_restore_code, &core_restore_code,
> - &restore_registers - &core_restore_code);
> + error = relocate_restore_code();
> + if (error)
> + return error;
>
> restore_image();
> return 0;
> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>
> struct restore_data_record {
> unsigned long jump_address;
> + unsigned long jump_address_phys;
> unsigned long cr3;
> unsigned long magic;
> };
>
> -#define RESTORE_MAGIC 0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC 0x123456789ABCDEF0UL
>
> /**
> * arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>
> if (max_size < sizeof(struct restore_data_record))
> return -EOVERFLOW;
> - rdr->jump_address = restore_jump_address;
> + rdr->jump_address = (unsigned long)&restore_registers;
> + rdr->jump_address_phys = __pa_symbol(&restore_registers);
> rdr->cr3 = restore_cr3;
> rdr->magic = RESTORE_MAGIC;
> return 0;
> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
> struct restore_data_record *rdr = addr;
>
> restore_jump_address = rdr->jump_address;
> + jump_address_phys = rdr->jump_address_phys;
> restore_cr3 = rdr->cr3;
> return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
> }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
> pushfq
> popq pt_regs_flags(%rax)
>
> - /* save the address of restore_registers */
> - movq $restore_registers, %rax
> - movq %rax, restore_jump_address(%rip)
> /* save cr3 */
> movq %cr3, %rax
> movq %rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
> ENDPROC(swsusp_arch_suspend)
>
> ENTRY(restore_image)
> - /* switch to temporary page tables */
> - movq $__PAGE_OFFSET, %rdx
> - movq temp_level4_pgt(%rip), %rax
> - subq %rdx, %rax
> - movq %rax, %cr3
> - /* Flush TLB */
> - movq mmu_cr4_features(%rip), %rax
> - movq %rax, %rdx
> - andq $~(X86_CR4_PGE), %rdx
> - movq %rdx, %cr4; # turn off PGE
> - movq %cr3, %rcx; # flush TLB
> - movq %rcx, %cr3;
> - movq %rax, %cr4; # turn PGE back on
> -
> /* prepare to jump to the image kernel */
> - movq restore_jump_address(%rip), %rax
> - movq restore_cr3(%rip), %rbx
> + movq restore_jump_address(%rip), %r8
> + movq restore_cr3(%rip), %r9
> +
> + /* prepare to switch to temporary page tables */
> + movq temp_level4_pgt(%rip), %rax
> + movq mmu_cr4_features(%rip), %rbx
>
> /* prepare to copy image data to their original locations */
> movq restore_pblist(%rip), %rdx
> +
> + /* jump to relocated restore code */
> movq relocated_restore_code(%rip), %rcx
> jmpq *%rcx
>
> /* code below has been relocated to a safe page */
> ENTRY(core_restore_code)
> + /* switch to temporary page tables */
> + movq $__PAGE_OFFSET, %rcx
> + subq %rcx, %rax
> + movq %rax, %cr3
> + /* flush TLB */
> + movq %rbx, %rcx
> + andq $~(X86_CR4_PGE), %rcx
> + movq %rcx, %cr4; # turn off PGE
> + movq %cr3, %rcx; # flush TLB
> + movq %rcx, %cr3;
> + movq %rbx, %cr4; # turn PGE back on
> .Lloop:
> testq %rdx, %rdx
> jz .Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
> /* progress to the next pbe */
> movq pbe_next(%rdx), %rdx
> jmp .Lloop
> +
> .Ldone:
> /* jump to the restore_registers address from the image header */
> - jmpq *%rax
> - /*
> - * NOTE: This assumes that the boot kernel's text mapping covers the
> - * image kernel's page containing restore_registers and the address of
> - * this page is the same as in the image kernel's text mapping (it
> - * should always be true, because the text mapping is linear, starting
> - * from 0, and is supposed to cover the entire kernel text for every
> - * kernel).
> - *
> - * code below belongs to the image kernel
> - */
> + jmpq *%r8
>
> + /* code below belongs to the image kernel */
> + .align PAGE_SIZE
> ENTRY(restore_registers)
> FRAME_BEGIN
> /* go back to the original page tables */
> - movq %rbx, %cr3
> + movq %r9, %cr3
>
> /* Flush TLB, including "global" things (vmalloc) */
> movq mmu_cr4_features(%rip), %rax
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-27 23:33 ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
@ 2016-06-29 14:48 ` Kees Cook
2016-06-30 1:52 ` Logan Gunthorpe
0 siblings, 1 reply; 39+ messages in thread
From: Kees Cook @ 2016-06-29 14:48 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Rafael J. Wysocki, Borislav Petkov, Linus Torvalds,
Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hey,
>
> This version is working for me as well. Thanks.
>
> Logan
>
> On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
>>
>> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>>>
>>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com>
>>> wrote:
>>>>
>>>> Hey Rafael,
>>>>
>>>> This patch appears to be working on my laptop. Thanks.
>>>
>>>
>>> Same for me: resume still works with KASLR in my tests too.
>>
>>
>> Unfortunately, Boris still sees post-resume memory corruption with the
>> patch
>> you have tested, but that turns out to be a result of some super-weird
>> corruption of a pointer on the stack which leads to a store at a wrong
>> address
>> (and there's no way it can be related to the rest of the patch).
>>
>> We have verified that it can be avoided by rearranging
>> set_up_temporary_text_mapping()
>> to use fewer local variables and the appended patch contains this
>> modification.
>>
>> I also went on and changed relocate_restore_code() slightly in a similar
>> fashion,
>> but all of those changes should not affect the behavior (unless there's
>> something
>> insane going on behind the curtains, that is).
>>
>> Kees, Logan, Boris, please try this one and let me know if it works for
>> you.
Tested-by: Kees Cook <keescook@chromium.org>
I've done a few KASLR boots, and everything continues to look good to
me. Thanks!
-Kees
>>
>> Thanks,
>> Rafael
>>
>>
>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption
>> during image restoration
>>
>> Logan Gunthorpe reports that hibernation stopped working reliably for
>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>> and rodata).
>>
>> That turns out to be a consequence of a long-standing issue with the
>> 64-bit image restoration code on x86, which is that the temporary
>> page tables set up by it to avoid page tables corruption when the
>> last bits of the image kernel's memory contents are copied into
>> their original page frames re-use the boot kernel's text mapping,
>> but that mapping may very well get corrupted just like any other
>> part of the page tables. Of course, if that happens, the final
>> jump to the image kernel's entry point will go to nowhere.
>>
>> The exact reason why commit ab76f7b4ab23 matters here is that it
>> sometimes causes a PMD of a large page to be split into PTEs
>> that are allocated dynamically and get corrupted during image
>> restoration as described above.
>>
>> To fix that issue note that the code copying the last bits of the
>> image kernel's memory contents to the page frames occupied by them
>> previoulsy doesn't use the kernel text mapping, because it runs from
>> a special page covered by the identity mapping set up for that code
>> from scratch. Hence, the kernel text mapping is only needed before
>> that code starts to run and then it will only be used just for the
>> final jump to the image kernel's entry point.
>>
>> Accordingly, the temporary page tables set up in swsusp_arch_resume()
>> on x86-64 need to contain the kernel text mapping too. That mapping
>> is only going to be used for the final jump to the image kernel, so
>> it only needs to cover the image kernel's entry point, because the
>> first thing the image kernel does after getting control back is to
>> switch over to its own original page tables. Moreover, the virtual
>> address of the image kernel's entry point in that mapping has to be
>> the same as the one mapped by the image kernel's page tables.
>>
>> With that in mind, modify the x86-64's arch_hibernation_header_save()
>> and arch_hibernation_header_restore() routines to pass the physical
>> address of the image kernel's entry point (in addition to its virtual
>> address) to the boot kernel (a small piece of assembly code involved
>> in passing the entry point's virtual address to the image kernel is
>> not necessary any more after that, so drop it). Update RESTORE_MAGIC
>> too to reflect the image header format change.
>>
>> Next, in set_up_temporary_mappings(), use the physical and virtual
>> addresses of the image kernel's entry point passed in the image
>> header to set up a minimum kernel text mapping (using memory pages
>> that won't be overwritten by the image kernel's memory contents) that
>> will map those addresses to each other as appropriate.
>>
>> This makes the concern about the possible corruption of the original
>> boot kernel text mapping go away and if the the minimum kernel text
>> mapping used for the final jump marks the image kernel's entry point
>> memory as executable, the jump to it is guaraneed to succeed.
>>
>> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
>> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>> arch/x86/power/hibernate_64.c | 90
>> ++++++++++++++++++++++++++++++++------
>> arch/x86/power/hibernate_asm_64.S | 55 ++++++++++-------------
>> 2 files changed, 102 insertions(+), 43 deletions(-)
>>
>> Index: linux-pm/arch/x86/power/hibernate_64.c
>> ===================================================================
>> --- linux-pm.orig/arch/x86/power/hibernate_64.c
>> +++ linux-pm/arch/x86/power/hibernate_64.c
>> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
>> * kernel's text (this value is passed in the image header).
>> */
>> unsigned long restore_jump_address __visible;
>> +unsigned long jump_address_phys;
>>
>> /*
>> * Value of the cr3 register from before the hibernation (this value is
>> passed
>> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>>
>> pgd_t *temp_level4_pgt __visible;
>>
>> -void *relocated_restore_code __visible;
>> +unsigned long relocated_restore_code __visible;
>> +
>> +static int set_up_temporary_text_mapping(void)
>> +{
>> + pmd_t *pmd;
>> + pud_t *pud;
>> +
>> + /*
>> + * The new mapping only has to cover the page containing the image
>> + * kernel's entry point (jump_address_phys), because the switch
>> over to
>> + * it is carried out by relocated code running from a page
>> allocated
>> + * specifically for this purpose and covered by the identity
>> mapping, so
>> + * the temporary kernel text mapping is only needed for the final
>> jump.
>> + * Moreover, in that mapping the virtual address of the image
>> kernel's
>> + * entry point must be the same as its virtual address in the
>> image
>> + * kernel (restore_jump_address), so the image kernel's
>> + * restore_registers() code doesn't find itself in a different
>> area of
>> + * the virtual address space after switching over to the original
>> page
>> + * tables used by the image kernel.
>> + */
>> + pud = (pud_t *)get_safe_page(GFP_ATOMIC);
>> + if (!pud)
>> + return -ENOMEM;
>> +
>> + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
>> + if (!pmd)
>> + return -ENOMEM;
>> +
>> + set_pmd(pmd + pmd_index(restore_jump_address),
>> + __pmd((jump_address_phys & PMD_MASK) |
>> __PAGE_KERNEL_LARGE_EXEC));
>> + set_pud(pud + pud_index(restore_jump_address),
>> + __pud(__pa(pmd) | _KERNPG_TABLE));
>> + set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
>> + __pgd(__pa(pud) | _KERNPG_TABLE));
>> +
>> + return 0;
>> +}
>>
>> static void *alloc_pgt_page(void *context)
>> {
>> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>> if (!temp_level4_pgt)
>> return -ENOMEM;
>>
>> - /* It is safe to reuse the original kernel mapping */
>> - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
>> - init_level4_pgt[pgd_index(__START_KERNEL_map)]);
>> + /* Prepare a temporary mapping for the kernel text */
>> + result = set_up_temporary_text_mapping();
>> + if (result)
>> + return result;
>>
>> /* Set up the direct mapping from scratch */
>> for (i = 0; i < nr_pfn_mapped; i++) {
>> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
>> return 0;
>> }
>>
>> +static int relocate_restore_code(void)
>> +{
>> + pgd_t *pgd;
>> + pmd_t *pmd;
>> +
>> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
>> + if (!relocated_restore_code)
>> + return -ENOMEM;
>> +
>> + memcpy((void *)relocated_restore_code, &core_restore_code,
>> PAGE_SIZE);
>> +
>> + /* Make the page containing the relocated code executable */
>> + pgd = (pgd_t *)__va(read_cr3()) +
>> pgd_index(relocated_restore_code);
>> + pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
>> + relocated_restore_code);
>> + if (pmd_large(*pmd)) {
>> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>> + } else {
>> + pte_t *pte = pte_offset_kernel(pmd,
>> relocated_restore_code);
>> +
>> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>> + }
>> +
>> + return 0;
>> +}
>> +
>> int swsusp_arch_resume(void)
>> {
>> int error;
>>
>> /* We have got enough memory and from now on we cannot recover */
>> - if ((error = set_up_temporary_mappings()))
>> + error = set_up_temporary_mappings();
>> + if (error)
>> return error;
>>
>> - relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
>> - if (!relocated_restore_code)
>> - return -ENOMEM;
>> - memcpy(relocated_restore_code, &core_restore_code,
>> - &restore_registers - &core_restore_code);
>> + error = relocate_restore_code();
>> + if (error)
>> + return error;
>>
>> restore_image();
>> return 0;
>> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>>
>> struct restore_data_record {
>> unsigned long jump_address;
>> + unsigned long jump_address_phys;
>> unsigned long cr3;
>> unsigned long magic;
>> };
>>
>> -#define RESTORE_MAGIC 0x0123456789ABCDEFUL
>> +#define RESTORE_MAGIC 0x123456789ABCDEF0UL
>>
>> /**
>> * arch_hibernation_header_save - populate the architecture specific
>> part
>> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>>
>> if (max_size < sizeof(struct restore_data_record))
>> return -EOVERFLOW;
>> - rdr->jump_address = restore_jump_address;
>> + rdr->jump_address = (unsigned long)&restore_registers;
>> + rdr->jump_address_phys = __pa_symbol(&restore_registers);
>> rdr->cr3 = restore_cr3;
>> rdr->magic = RESTORE_MAGIC;
>> return 0;
>> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
>> struct restore_data_record *rdr = addr;
>>
>> restore_jump_address = rdr->jump_address;
>> + jump_address_phys = rdr->jump_address_phys;
>> restore_cr3 = rdr->cr3;
>> return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>> }
>> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
>> ===================================================================
>> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
>> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
>> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>> pushfq
>> popq pt_regs_flags(%rax)
>>
>> - /* save the address of restore_registers */
>> - movq $restore_registers, %rax
>> - movq %rax, restore_jump_address(%rip)
>> /* save cr3 */
>> movq %cr3, %rax
>> movq %rax, restore_cr3(%rip)
>> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>> ENDPROC(swsusp_arch_suspend)
>>
>> ENTRY(restore_image)
>> - /* switch to temporary page tables */
>> - movq $__PAGE_OFFSET, %rdx
>> - movq temp_level4_pgt(%rip), %rax
>> - subq %rdx, %rax
>> - movq %rax, %cr3
>> - /* Flush TLB */
>> - movq mmu_cr4_features(%rip), %rax
>> - movq %rax, %rdx
>> - andq $~(X86_CR4_PGE), %rdx
>> - movq %rdx, %cr4; # turn off PGE
>> - movq %cr3, %rcx; # flush TLB
>> - movq %rcx, %cr3;
>> - movq %rax, %cr4; # turn PGE back on
>> -
>> /* prepare to jump to the image kernel */
>> - movq restore_jump_address(%rip), %rax
>> - movq restore_cr3(%rip), %rbx
>> + movq restore_jump_address(%rip), %r8
>> + movq restore_cr3(%rip), %r9
>> +
>> + /* prepare to switch to temporary page tables */
>> + movq temp_level4_pgt(%rip), %rax
>> + movq mmu_cr4_features(%rip), %rbx
>>
>> /* prepare to copy image data to their original locations */
>> movq restore_pblist(%rip), %rdx
>> +
>> + /* jump to relocated restore code */
>> movq relocated_restore_code(%rip), %rcx
>> jmpq *%rcx
>>
>> /* code below has been relocated to a safe page */
>> ENTRY(core_restore_code)
>> + /* switch to temporary page tables */
>> + movq $__PAGE_OFFSET, %rcx
>> + subq %rcx, %rax
>> + movq %rax, %cr3
>> + /* flush TLB */
>> + movq %rbx, %rcx
>> + andq $~(X86_CR4_PGE), %rcx
>> + movq %rcx, %cr4; # turn off PGE
>> + movq %cr3, %rcx; # flush TLB
>> + movq %rcx, %cr3;
>> + movq %rbx, %cr4; # turn PGE back on
>> .Lloop:
>> testq %rdx, %rdx
>> jz .Ldone
>> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>> /* progress to the next pbe */
>> movq pbe_next(%rdx), %rdx
>> jmp .Lloop
>> +
>> .Ldone:
>> /* jump to the restore_registers address from the image header */
>> - jmpq *%rax
>> - /*
>> - * NOTE: This assumes that the boot kernel's text mapping covers
>> the
>> - * image kernel's page containing restore_registers and the
>> address of
>> - * this page is the same as in the image kernel's text mapping (it
>> - * should always be true, because the text mapping is linear,
>> starting
>> - * from 0, and is supposed to cover the entire kernel text for
>> every
>> - * kernel).
>> - *
>> - * code below belongs to the image kernel
>> - */
>> + jmpq *%r8
>>
>> + /* code below belongs to the image kernel */
>> + .align PAGE_SIZE
>> ENTRY(restore_registers)
>> FRAME_BEGIN
>> /* go back to the original page tables */
>> - movq %rbx, %cr3
>> + movq %r9, %cr3
>>
>> /* Flush TLB, including "global" things (vmalloc) */
>> movq mmu_cr4_features(%rip), %rax
>>
>
--
Kees Cook
Chrome OS & Brillo Security
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-29 14:48 ` Kees Cook
@ 2016-06-30 1:52 ` Logan Gunthorpe
2016-06-30 2:20 ` Rafael J. Wysocki
0 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-30 1:52 UTC (permalink / raw)
To: Kees Cook
Cc: Rafael J. Wysocki, Borislav Petkov, Linus Torvalds,
Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
Hey Raf,
Sorry to report that although the patch works the majority of the time,
I just got a suspicious kernel panic during resume.
It said:
"kernel tried to execute NX protected page - exploit attempt? (uid: 0)"
You can find a photo of the panic here:
http://staff.deltatee.com/~logang/panic.jpg
Logan
On 29/06/16 08:48 AM, Kees Cook wrote:
> On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hey,
>>
>> This version is working for me as well. Thanks.
>>
>> Logan
>>
>> On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
>>>
>>> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>>>>
>>>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com>
>>>> wrote:
>>>>>
>>>>> Hey Rafael,
>>>>>
>>>>> This patch appears to be working on my laptop. Thanks.
>>>>
>>>>
>>>> Same for me: resume still works with KASLR in my tests too.
>>>
>>>
>>> Unfortunately, Boris still sees post-resume memory corruption with the
>>> patch
>>> you have tested, but that turns out to be a result of some super-weird
>>> corruption of a pointer on the stack which leads to a store at a wrong
>>> address
>>> (and there's no way it can be related to the rest of the patch).
>>>
>>> We have verified that it can be avoided by rearranging
>>> set_up_temporary_text_mapping()
>>> to use fewer local variables and the appended patch contains this
>>> modification.
>>>
>>> I also went on and changed relocate_restore_code() slightly in a similar
>>> fashion,
>>> but all of those changes should not affect the behavior (unless there's
>>> something
>>> insane going on behind the curtains, that is).
>>>
>>> Kees, Logan, Boris, please try this one and let me know if it works for
>>> you.
>
> Tested-by: Kees Cook <keescook@chromium.org>
>
> I've done a few KASLR boots, and everything continues to look good to
> me. Thanks!
>
> -Kees
>
>>>
>>> Thanks,
>>> Rafael
>>>
>>>
>>> ---
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption
>>> during image restoration
>>>
>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>> and rodata).
>>>
>>> That turns out to be a consequence of a long-standing issue with the
>>> 64-bit image restoration code on x86, which is that the temporary
>>> page tables set up by it to avoid page tables corruption when the
>>> last bits of the image kernel's memory contents are copied into
>>> their original page frames re-use the boot kernel's text mapping,
>>> but that mapping may very well get corrupted just like any other
>>> part of the page tables. Of course, if that happens, the final
>>> jump to the image kernel's entry point will go to nowhere.
>>>
>>> The exact reason why commit ab76f7b4ab23 matters here is that it
>>> sometimes causes a PMD of a large page to be split into PTEs
>>> that are allocated dynamically and get corrupted during image
>>> restoration as described above.
>>>
>>> To fix that issue note that the code copying the last bits of the
>>> image kernel's memory contents to the page frames occupied by them
>>> previoulsy doesn't use the kernel text mapping, because it runs from
>>> a special page covered by the identity mapping set up for that code
>>> from scratch. Hence, the kernel text mapping is only needed before
>>> that code starts to run and then it will only be used just for the
>>> final jump to the image kernel's entry point.
>>>
>>> Accordingly, the temporary page tables set up in swsusp_arch_resume()
>>> on x86-64 need to contain the kernel text mapping too. That mapping
>>> is only going to be used for the final jump to the image kernel, so
>>> it only needs to cover the image kernel's entry point, because the
>>> first thing the image kernel does after getting control back is to
>>> switch over to its own original page tables. Moreover, the virtual
>>> address of the image kernel's entry point in that mapping has to be
>>> the same as the one mapped by the image kernel's page tables.
>>>
>>> With that in mind, modify the x86-64's arch_hibernation_header_save()
>>> and arch_hibernation_header_restore() routines to pass the physical
>>> address of the image kernel's entry point (in addition to its virtual
>>> address) to the boot kernel (a small piece of assembly code involved
>>> in passing the entry point's virtual address to the image kernel is
>>> not necessary any more after that, so drop it). Update RESTORE_MAGIC
>>> too to reflect the image header format change.
>>>
>>> Next, in set_up_temporary_mappings(), use the physical and virtual
>>> addresses of the image kernel's entry point passed in the image
>>> header to set up a minimum kernel text mapping (using memory pages
>>> that won't be overwritten by the image kernel's memory contents) that
>>> will map those addresses to each other as appropriate.
>>>
>>> This makes the concern about the possible corruption of the original
>>> boot kernel text mapping go away and if the the minimum kernel text
>>> mapping used for the final jump marks the image kernel's entry point
>>> memory as executable, the jump to it is guaraneed to succeed.
>>>
>>> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
>>> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
>>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>> arch/x86/power/hibernate_64.c | 90
>>> ++++++++++++++++++++++++++++++++------
>>> arch/x86/power/hibernate_asm_64.S | 55 ++++++++++-------------
>>> 2 files changed, 102 insertions(+), 43 deletions(-)
>>>
>>> Index: linux-pm/arch/x86/power/hibernate_64.c
>>> ===================================================================
>>> --- linux-pm.orig/arch/x86/power/hibernate_64.c
>>> +++ linux-pm/arch/x86/power/hibernate_64.c
>>> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
>>> * kernel's text (this value is passed in the image header).
>>> */
>>> unsigned long restore_jump_address __visible;
>>> +unsigned long jump_address_phys;
>>>
>>> /*
>>> * Value of the cr3 register from before the hibernation (this value is
>>> passed
>>> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>>>
>>> pgd_t *temp_level4_pgt __visible;
>>>
>>> -void *relocated_restore_code __visible;
>>> +unsigned long relocated_restore_code __visible;
>>> +
>>> +static int set_up_temporary_text_mapping(void)
>>> +{
>>> + pmd_t *pmd;
>>> + pud_t *pud;
>>> +
>>> + /*
>>> + * The new mapping only has to cover the page containing the image
>>> + * kernel's entry point (jump_address_phys), because the switch
>>> over to
>>> + * it is carried out by relocated code running from a page
>>> allocated
>>> + * specifically for this purpose and covered by the identity
>>> mapping, so
>>> + * the temporary kernel text mapping is only needed for the final
>>> jump.
>>> + * Moreover, in that mapping the virtual address of the image
>>> kernel's
>>> + * entry point must be the same as its virtual address in the
>>> image
>>> + * kernel (restore_jump_address), so the image kernel's
>>> + * restore_registers() code doesn't find itself in a different
>>> area of
>>> + * the virtual address space after switching over to the original
>>> page
>>> + * tables used by the image kernel.
>>> + */
>>> + pud = (pud_t *)get_safe_page(GFP_ATOMIC);
>>> + if (!pud)
>>> + return -ENOMEM;
>>> +
>>> + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
>>> + if (!pmd)
>>> + return -ENOMEM;
>>> +
>>> + set_pmd(pmd + pmd_index(restore_jump_address),
>>> + __pmd((jump_address_phys & PMD_MASK) |
>>> __PAGE_KERNEL_LARGE_EXEC));
>>> + set_pud(pud + pud_index(restore_jump_address),
>>> + __pud(__pa(pmd) | _KERNPG_TABLE));
>>> + set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
>>> + __pgd(__pa(pud) | _KERNPG_TABLE));
>>> +
>>> + return 0;
>>> +}
>>>
>>> static void *alloc_pgt_page(void *context)
>>> {
>>> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>>> if (!temp_level4_pgt)
>>> return -ENOMEM;
>>>
>>> - /* It is safe to reuse the original kernel mapping */
>>> - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
>>> - init_level4_pgt[pgd_index(__START_KERNEL_map)]);
>>> + /* Prepare a temporary mapping for the kernel text */
>>> + result = set_up_temporary_text_mapping();
>>> + if (result)
>>> + return result;
>>>
>>> /* Set up the direct mapping from scratch */
>>> for (i = 0; i < nr_pfn_mapped; i++) {
>>> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
>>> return 0;
>>> }
>>>
>>> +static int relocate_restore_code(void)
>>> +{
>>> + pgd_t *pgd;
>>> + pmd_t *pmd;
>>> +
>>> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>> + if (!relocated_restore_code)
>>> + return -ENOMEM;
>>> +
>>> + memcpy((void *)relocated_restore_code, &core_restore_code,
>>> PAGE_SIZE);
>>> +
>>> + /* Make the page containing the relocated code executable */
>>> + pgd = (pgd_t *)__va(read_cr3()) +
>>> pgd_index(relocated_restore_code);
>>> + pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
>>> + relocated_restore_code);
>>> + if (pmd_large(*pmd)) {
>>> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>> + } else {
>>> + pte_t *pte = pte_offset_kernel(pmd,
>>> relocated_restore_code);
>>> +
>>> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> int swsusp_arch_resume(void)
>>> {
>>> int error;
>>>
>>> /* We have got enough memory and from now on we cannot recover */
>>> - if ((error = set_up_temporary_mappings()))
>>> + error = set_up_temporary_mappings();
>>> + if (error)
>>> return error;
>>>
>>> - relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
>>> - if (!relocated_restore_code)
>>> - return -ENOMEM;
>>> - memcpy(relocated_restore_code, &core_restore_code,
>>> - &restore_registers - &core_restore_code);
>>> + error = relocate_restore_code();
>>> + if (error)
>>> + return error;
>>>
>>> restore_image();
>>> return 0;
>>> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>>>
>>> struct restore_data_record {
>>> unsigned long jump_address;
>>> + unsigned long jump_address_phys;
>>> unsigned long cr3;
>>> unsigned long magic;
>>> };
>>>
>>> -#define RESTORE_MAGIC 0x0123456789ABCDEFUL
>>> +#define RESTORE_MAGIC 0x123456789ABCDEF0UL
>>>
>>> /**
>>> * arch_hibernation_header_save - populate the architecture specific
>>> part
>>> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>>>
>>> if (max_size < sizeof(struct restore_data_record))
>>> return -EOVERFLOW;
>>> - rdr->jump_address = restore_jump_address;
>>> + rdr->jump_address = (unsigned long)&restore_registers;
>>> + rdr->jump_address_phys = __pa_symbol(&restore_registers);
>>> rdr->cr3 = restore_cr3;
>>> rdr->magic = RESTORE_MAGIC;
>>> return 0;
>>> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
>>> struct restore_data_record *rdr = addr;
>>>
>>> restore_jump_address = rdr->jump_address;
>>> + jump_address_phys = rdr->jump_address_phys;
>>> restore_cr3 = rdr->cr3;
>>> return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>>> }
>>> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
>>> ===================================================================
>>> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
>>> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
>>> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>>> pushfq
>>> popq pt_regs_flags(%rax)
>>>
>>> - /* save the address of restore_registers */
>>> - movq $restore_registers, %rax
>>> - movq %rax, restore_jump_address(%rip)
>>> /* save cr3 */
>>> movq %cr3, %rax
>>> movq %rax, restore_cr3(%rip)
>>> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>>> ENDPROC(swsusp_arch_suspend)
>>>
>>> ENTRY(restore_image)
>>> - /* switch to temporary page tables */
>>> - movq $__PAGE_OFFSET, %rdx
>>> - movq temp_level4_pgt(%rip), %rax
>>> - subq %rdx, %rax
>>> - movq %rax, %cr3
>>> - /* Flush TLB */
>>> - movq mmu_cr4_features(%rip), %rax
>>> - movq %rax, %rdx
>>> - andq $~(X86_CR4_PGE), %rdx
>>> - movq %rdx, %cr4; # turn off PGE
>>> - movq %cr3, %rcx; # flush TLB
>>> - movq %rcx, %cr3;
>>> - movq %rax, %cr4; # turn PGE back on
>>> -
>>> /* prepare to jump to the image kernel */
>>> - movq restore_jump_address(%rip), %rax
>>> - movq restore_cr3(%rip), %rbx
>>> + movq restore_jump_address(%rip), %r8
>>> + movq restore_cr3(%rip), %r9
>>> +
>>> + /* prepare to switch to temporary page tables */
>>> + movq temp_level4_pgt(%rip), %rax
>>> + movq mmu_cr4_features(%rip), %rbx
>>>
>>> /* prepare to copy image data to their original locations */
>>> movq restore_pblist(%rip), %rdx
>>> +
>>> + /* jump to relocated restore code */
>>> movq relocated_restore_code(%rip), %rcx
>>> jmpq *%rcx
>>>
>>> /* code below has been relocated to a safe page */
>>> ENTRY(core_restore_code)
>>> + /* switch to temporary page tables */
>>> + movq $__PAGE_OFFSET, %rcx
>>> + subq %rcx, %rax
>>> + movq %rax, %cr3
>>> + /* flush TLB */
>>> + movq %rbx, %rcx
>>> + andq $~(X86_CR4_PGE), %rcx
>>> + movq %rcx, %cr4; # turn off PGE
>>> + movq %cr3, %rcx; # flush TLB
>>> + movq %rcx, %cr3;
>>> + movq %rbx, %cr4; # turn PGE back on
>>> .Lloop:
>>> testq %rdx, %rdx
>>> jz .Ldone
>>> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>>> /* progress to the next pbe */
>>> movq pbe_next(%rdx), %rdx
>>> jmp .Lloop
>>> +
>>> .Ldone:
>>> /* jump to the restore_registers address from the image header */
>>> - jmpq *%rax
>>> - /*
>>> - * NOTE: This assumes that the boot kernel's text mapping covers
>>> the
>>> - * image kernel's page containing restore_registers and the
>>> address of
>>> - * this page is the same as in the image kernel's text mapping (it
>>> - * should always be true, because the text mapping is linear,
>>> starting
>>> - * from 0, and is supposed to cover the entire kernel text for
>>> every
>>> - * kernel).
>>> - *
>>> - * code below belongs to the image kernel
>>> - */
>>> + jmpq *%r8
>>>
>>> + /* code below belongs to the image kernel */
>>> + .align PAGE_SIZE
>>> ENTRY(restore_registers)
>>> FRAME_BEGIN
>>> /* go back to the original page tables */
>>> - movq %rbx, %cr3
>>> + movq %r9, %cr3
>>>
>>> /* Flush TLB, including "global" things (vmalloc) */
>>> movq mmu_cr4_features(%rip), %rax
>>>
>>
>
>
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 1:52 ` Logan Gunthorpe
@ 2016-06-30 2:20 ` Rafael J. Wysocki
2016-06-30 2:55 ` Rafael J. Wysocki
2016-06-30 9:45 ` Borislav Petkov
0 siblings, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 2:20 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Kees Cook, Borislav Petkov, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On Wednesday, June 29, 2016 07:52:18 PM Logan Gunthorpe wrote:
> Hey Raf,
>
> Sorry to report that although the patch works the majority of the time,
> I just got a suspicious kernel panic during resume.
>
> It said:
>
> "kernel tried to execute NX protected page - exploit attempt? (uid: 0)"
>
> You can find a photo of the panic here:
>
> http://staff.deltatee.com/~logang/panic.jpg
Thanks for the report!
That's not what Boris was seeing at least.
It looks like clearing the NX bit for relocated_restore_code in
relocate_restore_code() didn't work for some reason.
I don't see why it may not work ATM, I need to have a fresh look at that
tomorrow.
I had hoped to be able to fix this bug for 4.7, but it looks like it will
miss the mark after all. Oh well.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 2:20 ` Rafael J. Wysocki
@ 2016-06-30 2:55 ` Rafael J. Wysocki
2016-06-30 3:56 ` Logan Gunthorpe
2016-06-30 9:45 ` Borislav Petkov
1 sibling, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 2:55 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Kees Cook, Borislav Petkov, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On Thursday, June 30, 2016 04:20:43 AM Rafael J. Wysocki wrote:
> On Wednesday, June 29, 2016 07:52:18 PM Logan Gunthorpe wrote:
> > Hey Raf,
> >
> > Sorry to report that although the patch works the majority of the time,
> > I just got a suspicious kernel panic during resume.
> >
> > It said:
> >
> > "kernel tried to execute NX protected page - exploit attempt? (uid: 0)"
> >
> > You can find a photo of the panic here:
> >
> > http://staff.deltatee.com/~logang/panic.jpg
>
> Thanks for the report!
>
> That's not what Boris was seeing at least.
>
> It looks like clearing the NX bit for relocated_restore_code in
> relocate_restore_code() didn't work for some reason.
>
> I don't see why it may not work ATM, I need to have a fresh look at that
> tomorrow.
>
> I had hoped to be able to fix this bug for 4.7, but it looks like it will
> miss the mark after all. Oh well.
The only thing that comes to mind at this point is that TLBs should be flushed
after page tables changes, so please apply the appended and let me know
if you see this panic any more with it.
Thanks,
Rafael
---
arch/x86/power/hibernate_64.c | 92 +++++++++++++++++++++++++++++++++-----
arch/x86/power/hibernate_asm_64.S | 55 +++++++++-------------
2 files changed, 104 insertions(+), 43 deletions(-)
Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
#include <asm/mtrr.h>
#include <asm/sections.h>
#include <asm/suspend.h>
+#include <asm/tlbflush.h>
/* Defined in hibernate_asm_64.S */
extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
* kernel's text (this value is passed in the image header).
*/
unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
/*
* Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
pgd_t *temp_level4_pgt __visible;
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+ pmd_t *pmd;
+ pud_t *pud;
+
+ /*
+ * The new mapping only has to cover the page containing the image
+ * kernel's entry point (jump_address_phys), because the switch over to
+ * it is carried out by relocated code running from a page allocated
+ * specifically for this purpose and covered by the identity mapping, so
+ * the temporary kernel text mapping is only needed for the final jump.
+ * Moreover, in that mapping the virtual address of the image kernel's
+ * entry point must be the same as its virtual address in the image
+ * kernel (restore_jump_address), so the image kernel's
+ * restore_registers() code doesn't find itself in a different area of
+ * the virtual address space after switching over to the original page
+ * tables used by the image kernel.
+ */
+ pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+ if (!pud)
+ return -ENOMEM;
+
+ pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+ if (!pmd)
+ return -ENOMEM;
+
+ set_pmd(pmd + pmd_index(restore_jump_address),
+ __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+ set_pud(pud + pud_index(restore_jump_address),
+ __pud(__pa(pmd) | _KERNPG_TABLE));
+ set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+ __pgd(__pa(pud) | _KERNPG_TABLE));
+
+ return 0;
+}
static void *alloc_pgt_page(void *context)
{
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
if (!temp_level4_pgt)
return -ENOMEM;
- /* It is safe to reuse the original kernel mapping */
- set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
- init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+ /* Prepare a temporary mapping for the kernel text */
+ result = set_up_temporary_text_mapping();
+ if (result)
+ return result;
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,45 @@ static int set_up_temporary_mappings(voi
return 0;
}
+static int relocate_restore_code(void)
+{
+ pgd_t *pgd;
+ pmd_t *pmd;
+
+ relocated_restore_code = get_safe_page(GFP_ATOMIC);
+ if (!relocated_restore_code)
+ return -ENOMEM;
+
+ memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+ /* Make the page containing the relocated code executable */
+ pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+ pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
+ relocated_restore_code);
+ if (pmd_large(*pmd)) {
+ set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+ } else {
+ pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+ }
+ flush_tlb_all();
+
+ return 0;
+}
+
int swsusp_arch_resume(void)
{
int error;
/* We have got enough memory and from now on we cannot recover */
- if ((error = set_up_temporary_mappings()))
+ error = set_up_temporary_mappings();
+ if (error)
return error;
- relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
- if (!relocated_restore_code)
- return -ENOMEM;
- memcpy(relocated_restore_code, &core_restore_code,
- &restore_registers - &core_restore_code);
+ error = relocate_restore_code();
+ if (error)
+ return error;
restore_image();
return 0;
@@ -109,11 +174,12 @@ int pfn_is_nosave(unsigned long pfn)
struct restore_data_record {
unsigned long jump_address;
+ unsigned long jump_address_phys;
unsigned long cr3;
unsigned long magic;
};
-#define RESTORE_MAGIC 0x0123456789ABCDEFUL
+#define RESTORE_MAGIC 0x123456789ABCDEF0UL
/**
* arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +192,8 @@ int arch_hibernation_header_save(void *a
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
- rdr->jump_address = restore_jump_address;
+ rdr->jump_address = (unsigned long)&restore_registers;
+ rdr->jump_address_phys = __pa_symbol(&restore_registers);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -142,6 +209,7 @@ int arch_hibernation_header_restore(void
struct restore_data_record *rdr = addr;
restore_jump_address = rdr->jump_address;
+ jump_address_phys = rdr->jump_address_phys;
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
}
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
pushfq
popq pt_regs_flags(%rax)
- /* save the address of restore_registers */
- movq $restore_registers, %rax
- movq %rax, restore_jump_address(%rip)
/* save cr3 */
movq %cr3, %rax
movq %rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
ENDPROC(swsusp_arch_suspend)
ENTRY(restore_image)
- /* switch to temporary page tables */
- movq $__PAGE_OFFSET, %rdx
- movq temp_level4_pgt(%rip), %rax
- subq %rdx, %rax
- movq %rax, %cr3
- /* Flush TLB */
- movq mmu_cr4_features(%rip), %rax
- movq %rax, %rdx
- andq $~(X86_CR4_PGE), %rdx
- movq %rdx, %cr4; # turn off PGE
- movq %cr3, %rcx; # flush TLB
- movq %rcx, %cr3;
- movq %rax, %cr4; # turn PGE back on
-
/* prepare to jump to the image kernel */
- movq restore_jump_address(%rip), %rax
- movq restore_cr3(%rip), %rbx
+ movq restore_jump_address(%rip), %r8
+ movq restore_cr3(%rip), %r9
+
+ /* prepare to switch to temporary page tables */
+ movq temp_level4_pgt(%rip), %rax
+ movq mmu_cr4_features(%rip), %rbx
/* prepare to copy image data to their original locations */
movq restore_pblist(%rip), %rdx
+
+ /* jump to relocated restore code */
movq relocated_restore_code(%rip), %rcx
jmpq *%rcx
/* code below has been relocated to a safe page */
ENTRY(core_restore_code)
+ /* switch to temporary page tables */
+ movq $__PAGE_OFFSET, %rcx
+ subq %rcx, %rax
+ movq %rax, %cr3
+ /* flush TLB */
+ movq %rbx, %rcx
+ andq $~(X86_CR4_PGE), %rcx
+ movq %rcx, %cr4; # turn off PGE
+ movq %cr3, %rcx; # flush TLB
+ movq %rcx, %cr3;
+ movq %rbx, %cr4; # turn PGE back on
.Lloop:
testq %rdx, %rdx
jz .Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
/* progress to the next pbe */
movq pbe_next(%rdx), %rdx
jmp .Lloop
+
.Ldone:
/* jump to the restore_registers address from the image header */
- jmpq *%rax
- /*
- * NOTE: This assumes that the boot kernel's text mapping covers the
- * image kernel's page containing restore_registers and the address of
- * this page is the same as in the image kernel's text mapping (it
- * should always be true, because the text mapping is linear, starting
- * from 0, and is supposed to cover the entire kernel text for every
- * kernel).
- *
- * code below belongs to the image kernel
- */
+ jmpq *%r8
+ /* code below belongs to the image kernel */
+ .align PAGE_SIZE
ENTRY(restore_registers)
FRAME_BEGIN
/* go back to the original page tables */
- movq %rbx, %cr3
+ movq %r9, %cr3
/* Flush TLB, including "global" things (vmalloc) */
movq mmu_cr4_features(%rip), %rax
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 2:55 ` Rafael J. Wysocki
@ 2016-06-30 3:56 ` Logan Gunthorpe
2016-06-30 12:16 ` Rafael J. Wysocki
0 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-30 3:56 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Kees Cook, Borislav Petkov, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On 29/06/16 08:55 PM, Rafael J. Wysocki wrote:
> The only thing that comes to mind at this point is that TLBs should be flushed
> after page tables changes, so please apply the appended and let me know
> if you see this panic any more with it.
Ok, I'll build a new kernel tomorrow. But keep in mind the panic is
pretty rare as I've only seen it once so far after a couple dozen or so
hibernates. So it may be hard to get a concrete yes or no on whether it
fixes the issue.
I've got a script to run a bunch of hibernates in a row. I usually only
run it for a handful of iterations, but I'll try running it for much
longer with this patch and let you know in a couple days.
Logan
> Thanks,
> Rafael
>
>
> ---
> arch/x86/power/hibernate_64.c | 92 +++++++++++++++++++++++++++++++++-----
> arch/x86/power/hibernate_asm_64.S | 55 +++++++++-------------
> 2 files changed, 104 insertions(+), 43 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -19,6 +19,7 @@
> #include <asm/mtrr.h>
> #include <asm/sections.h>
> #include <asm/suspend.h>
> +#include <asm/tlbflush.h>
>
> /* Defined in hibernate_asm_64.S */
> extern asmlinkage __visible int restore_image(void);
> @@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
> * kernel's text (this value is passed in the image header).
> */
> unsigned long restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
> /*
> * Value of the cr3 register from before the hibernation (this value is passed
> @@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
>
> pgd_t *temp_level4_pgt __visible;
>
> -void *relocated_restore_code __visible;
> +unsigned long relocated_restore_code __visible;
> +
> +static int set_up_temporary_text_mapping(void)
> +{
> + pmd_t *pmd;
> + pud_t *pud;
> +
> + /*
> + * The new mapping only has to cover the page containing the image
> + * kernel's entry point (jump_address_phys), because the switch over to
> + * it is carried out by relocated code running from a page allocated
> + * specifically for this purpose and covered by the identity mapping, so
> + * the temporary kernel text mapping is only needed for the final jump.
> + * Moreover, in that mapping the virtual address of the image kernel's
> + * entry point must be the same as its virtual address in the image
> + * kernel (restore_jump_address), so the image kernel's
> + * restore_registers() code doesn't find itself in a different area of
> + * the virtual address space after switching over to the original page
> + * tables used by the image kernel.
> + */
> + pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> + if (!pud)
> + return -ENOMEM;
> +
> + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> + if (!pmd)
> + return -ENOMEM;
> +
> + set_pmd(pmd + pmd_index(restore_jump_address),
> + __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
> + set_pud(pud + pud_index(restore_jump_address),
> + __pud(__pa(pmd) | _KERNPG_TABLE));
> + set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
> + __pgd(__pa(pud) | _KERNPG_TABLE));
> +
> + return 0;
> +}
>
> static void *alloc_pgt_page(void *context)
> {
> @@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
> if (!temp_level4_pgt)
> return -ENOMEM;
>
> - /* It is safe to reuse the original kernel mapping */
> - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> - init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> + /* Prepare a temporary mapping for the kernel text */
> + result = set_up_temporary_text_mapping();
> + if (result)
> + return result;
>
> /* Set up the direct mapping from scratch */
> for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +117,45 @@ static int set_up_temporary_mappings(voi
> return 0;
> }
>
> +static int relocate_restore_code(void)
> +{
> + pgd_t *pgd;
> + pmd_t *pmd;
> +
> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
> + if (!relocated_restore_code)
> + return -ENOMEM;
> +
> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> + /* Make the page containing the relocated code executable */
> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> + pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
> + relocated_restore_code);
> + if (pmd_large(*pmd)) {
> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> + } else {
> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> + }
> + flush_tlb_all();
> +
> + return 0;
> +}
> +
> int swsusp_arch_resume(void)
> {
> int error;
>
> /* We have got enough memory and from now on we cannot recover */
> - if ((error = set_up_temporary_mappings()))
> + error = set_up_temporary_mappings();
> + if (error)
> return error;
>
> - relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> - if (!relocated_restore_code)
> - return -ENOMEM;
> - memcpy(relocated_restore_code, &core_restore_code,
> - &restore_registers - &core_restore_code);
> + error = relocate_restore_code();
> + if (error)
> + return error;
>
> restore_image();
> return 0;
> @@ -109,11 +174,12 @@ int pfn_is_nosave(unsigned long pfn)
>
> struct restore_data_record {
> unsigned long jump_address;
> + unsigned long jump_address_phys;
> unsigned long cr3;
> unsigned long magic;
> };
>
> -#define RESTORE_MAGIC 0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC 0x123456789ABCDEF0UL
>
> /**
> * arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +192,8 @@ int arch_hibernation_header_save(void *a
>
> if (max_size < sizeof(struct restore_data_record))
> return -EOVERFLOW;
> - rdr->jump_address = restore_jump_address;
> + rdr->jump_address = (unsigned long)&restore_registers;
> + rdr->jump_address_phys = __pa_symbol(&restore_registers);
> rdr->cr3 = restore_cr3;
> rdr->magic = RESTORE_MAGIC;
> return 0;
> @@ -142,6 +209,7 @@ int arch_hibernation_header_restore(void
> struct restore_data_record *rdr = addr;
>
> restore_jump_address = rdr->jump_address;
> + jump_address_phys = rdr->jump_address_phys;
> restore_cr3 = rdr->cr3;
> return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
> }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
> pushfq
> popq pt_regs_flags(%rax)
>
> - /* save the address of restore_registers */
> - movq $restore_registers, %rax
> - movq %rax, restore_jump_address(%rip)
> /* save cr3 */
> movq %cr3, %rax
> movq %rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
> ENDPROC(swsusp_arch_suspend)
>
> ENTRY(restore_image)
> - /* switch to temporary page tables */
> - movq $__PAGE_OFFSET, %rdx
> - movq temp_level4_pgt(%rip), %rax
> - subq %rdx, %rax
> - movq %rax, %cr3
> - /* Flush TLB */
> - movq mmu_cr4_features(%rip), %rax
> - movq %rax, %rdx
> - andq $~(X86_CR4_PGE), %rdx
> - movq %rdx, %cr4; # turn off PGE
> - movq %cr3, %rcx; # flush TLB
> - movq %rcx, %cr3;
> - movq %rax, %cr4; # turn PGE back on
> -
> /* prepare to jump to the image kernel */
> - movq restore_jump_address(%rip), %rax
> - movq restore_cr3(%rip), %rbx
> + movq restore_jump_address(%rip), %r8
> + movq restore_cr3(%rip), %r9
> +
> + /* prepare to switch to temporary page tables */
> + movq temp_level4_pgt(%rip), %rax
> + movq mmu_cr4_features(%rip), %rbx
>
> /* prepare to copy image data to their original locations */
> movq restore_pblist(%rip), %rdx
> +
> + /* jump to relocated restore code */
> movq relocated_restore_code(%rip), %rcx
> jmpq *%rcx
>
> /* code below has been relocated to a safe page */
> ENTRY(core_restore_code)
> + /* switch to temporary page tables */
> + movq $__PAGE_OFFSET, %rcx
> + subq %rcx, %rax
> + movq %rax, %cr3
> + /* flush TLB */
> + movq %rbx, %rcx
> + andq $~(X86_CR4_PGE), %rcx
> + movq %rcx, %cr4; # turn off PGE
> + movq %cr3, %rcx; # flush TLB
> + movq %rcx, %cr3;
> + movq %rbx, %cr4; # turn PGE back on
> .Lloop:
> testq %rdx, %rdx
> jz .Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
> /* progress to the next pbe */
> movq pbe_next(%rdx), %rdx
> jmp .Lloop
> +
> .Ldone:
> /* jump to the restore_registers address from the image header */
> - jmpq *%rax
> - /*
> - * NOTE: This assumes that the boot kernel's text mapping covers the
> - * image kernel's page containing restore_registers and the address of
> - * this page is the same as in the image kernel's text mapping (it
> - * should always be true, because the text mapping is linear, starting
> - * from 0, and is supposed to cover the entire kernel text for every
> - * kernel).
> - *
> - * code below belongs to the image kernel
> - */
> + jmpq *%r8
>
> + /* code below belongs to the image kernel */
> + .align PAGE_SIZE
> ENTRY(restore_registers)
> FRAME_BEGIN
> /* go back to the original page tables */
> - movq %rbx, %cr3
> + movq %r9, %cr3
>
> /* Flush TLB, including "global" things (vmalloc) */
> movq mmu_cr4_features(%rip), %rax
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 3:56 ` Logan Gunthorpe
@ 2016-06-30 12:16 ` Rafael J. Wysocki
0 siblings, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 12:16 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: Rafael J. Wysocki, Kees Cook, Borislav Petkov, Linus Torvalds,
Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 5:56 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 29/06/16 08:55 PM, Rafael J. Wysocki wrote:
>
>> The only thing that comes to mind at this point is that TLBs should be
>> flushed
>> after page tables changes, so please apply the appended and let me know
>> if you see this panic any more with it.
>
>
>
> Ok, I'll build a new kernel tomorrow.
Well, please wait for a while.
I'm looking at the panic log right now and the picture is a bit clearer now.
> But keep in mind the panic is pretty
> rare as I've only seen it once so far after a couple dozen or so hibernates.
> So it may be hard to get a concrete yes or no on whether it fixes the issue.
Right.
> I've got a script to run a bunch of hibernates in a row. I usually only run
> it for a handful of iterations, but I'll try running it for much longer with
> this patch and let you know in a couple days.
As I said, please wait a bit, there may be updates in the meantime. :-)
>From looking at your panic log, the exception happened in
swsusp_arch_resume(), which probably covers restore_image() too,
because that is likely to go into swsusp_arch_resume() in line.
Next, the address in RIP (a) clearly is a page start and (b) is
relative to the identity mapping, so it most likely is the address
from relocated_restore_code. Moreover, the RCX value is the same
address and the values in the other registers also match exactly the
situation before the jump to relocated_restore_code. Thus the
exception was triggered by that jump beyond doubt.
Now, if you look above the Oops line, it becomes quite clear what
happened. Namely, dump_pagetable() (arch/x86/mm/fault.c, line 524 in
4.6) prints the PGD, the PUD, the PMD and the PTE in that order,
unless the lower levels (PTE, PMD) are not present. In your panic
log, only the PGD and PUD are present, meaning that this is a 1G page
and sure enough it has the NX bit set.
This case was clearly overlooked by relocate_restore_code(), so
updated patch will follow.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 2:20 ` Rafael J. Wysocki
2016-06-30 2:55 ` Rafael J. Wysocki
@ 2016-06-30 9:45 ` Borislav Petkov
2016-06-30 11:27 ` Rafael J. Wysocki
1 sibling, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-30 9:45 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Logan Gunthorpe, Kees Cook, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 04:20:43AM +0200, Rafael J. Wysocki wrote:
> That's not what Boris was seeing at least.
Well, I had it a couple of times during testing patches. This is all
from the logs:
[ 65.121109] PM: Basic memory bitmaps freed
[ 65.125991] Restarting tasks ...
[ 65.129342] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 65.129585] done.
[ 65.141314] BUG: unable to handle kernel paging request at ffff88042b957e40
[ 65.141316] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[ 65.141318] PGD 2067067 PUD 206a067 PMD 800000042b8001e3
[ 65.141319] Oops: 0011 [#1] PREEMPT SMP
[ 65.141327] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod amdkfd kvm_amd kvm amd_iommu_v2 irqbypass crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq
[ 65.141328] CPU: 6 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #4
[ 65.141329] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[ 65.141329] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[ 65.141331] RIP: 0010:[<ffff88042b957e40>] [<ffff88042b957e40>] 0xffff88042b957e40
[ 65.141331] RSP: 0018:ffff88042b957e00 EFLAGS: 00010282
[ 65.141332] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[ 65.141333] RDX: 0000000000000001 RSI: ffffffff81063b59 RDI: ffffffff8168898c
[ 65.141333] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[ 65.141334] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88042b954000
[ 65.141334] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[ 65.141335] FS: 00007fad32173800(0000) GS:ffff88043dd80000(0000) knlGS:0000000000000000
[ 65.141336] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 65.141336] CR2: ffff88042b957e40 CR3: 00000004298e6000 CR4: 00000000000406e0
[ 65.141336] Stack:
[ 65.141338] ffff880037b81000 ffff880037b81000 0000000000000000 ffffffff81181e1e
[ 65.141339] ffffff9c00000002 ffff880429e8c600 ffffffff811782bf 0000000000000011
[ 65.141340] 000000000000049c 0000000000000001 0000000000001180 0000000000000000
[ 65.141340] Call Trace:
[ 65.141344] [<ffffffff81181e1e>] ? getname_flags+0x5e/0x1b0
[ 65.141346] [<ffffffff811782bf>] ? cp_new_stat+0x10f/0x120
[ 65.141348] [<ffffffff810bb33a>] ? ktime_get_ts64+0x4a/0xf0
[ 65.141353] [<ffffffff81185fc7>] ? poll_select_copy_remaining+0xe7/0x130
[ 65.141355] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[ 65.141356] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[ 65.141358] [<ffffffff81688e72>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[ 65.141374] Code: 00 00 00 1e 1e 18 81 ff ff ff ff 02 00 00 00 9c ff ff ff 00 c6 e8 29 04 88 ff ff bf 82 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <9c> 04 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00
[ 65.141375] RIP [<ffff88042b957e40>] 0xffff88042b957e40
[ 65.141376] RSP <ffff88042b957e00>
[ 65.141376] CR2: ffff88042b957e40
[ 65.141378] ---[ end trace 5dc71ecf8d888ee6 ]---
[ 65.141509] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 65.141509]
[ 65.149191] Kernel Offset: disabled
[ 65.449314] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
...
[ 381.835297] Restarting tasks ...
[ 381.838620] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 381.838689] done.
[ 381.850763] BUG: unable to handle kernel paging request at ffff88042b957e40
[ 381.850765] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[ 381.850766] PGD 2065067 PUD 2068067 PMD 800000042b8001e3
[ 381.850767] Oops: 0011 [#1] PREEMPT SMP
[ 381.850778] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod amdkfd kvm_amd kvm amd_iommu_v2 radeon irqbypass crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd k10temp fam15h_power acpi_cpufreq
[ 381.850779] CPU: 3 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1
[ 381.850780] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[ 381.850781] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[ 381.850782] RIP: 0010:[<ffff88042b957e40>] [<ffff88042b957e40>] 0xffff88042b957e40
[ 381.850783] RSP: 0018:ffff88042b957e00 EFLAGS: 00010282
[ 381.850783] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[ 381.850784] RDX: 0000000000000001 RSI: ffffffff81062a2d RDI: ffffffff81687d8c
[ 381.850784] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[ 381.850785] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88042b954000
[ 381.850785] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[ 381.850786] FS: 00007f1143649800(0000) GS:ffff88043dcc0000(0000) knlGS:0000000000000000
[ 381.850787] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 381.850787] CR2: ffff88042b957e40 CR3: 00000004298af000 CR4: 00000000000406e0
[ 381.850788] Stack:
[ 381.850789] ffff88042b1ed000 ffff88042b1ed000 0000000000000000 ffffffff8117f8ae
[ 381.850790] ffffff9c00000002 ffff88042b09ac00 ffffffff81175d5f 0000000000000011
[ 381.850791] 0000000000001c3d 0000000000000001 0000000000001180 0000000000000000
[ 381.850792] Call Trace:
[ 381.850795] [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
[ 381.850797] [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
[ 381.850799] [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
[ 381.850800] [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
[ 381.850802] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[ 381.850804] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[ 381.850806] [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[ 381.850820] Code: 00 00 00 ae f8 17 81 ff ff ff ff 02 00 00 00 9c ff ff ff 00 ac 09 2b 04 88 ff ff 5f 5d 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <3d> 1c 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00
[ 381.850821] RIP [<ffff88042b957e40>] 0xffff88042b957e40
[ 381.850821] RSP <ffff88042b957e00>
[ 381.850821] CR2: ffff88042b957e40
[ 381.850824] ---[ end trace b4f9b4244a59d886 ]---
[ 381.851025] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
...
[ 49.003526] Restarting tasks ...
[ 49.007083] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 49.007237] done.
[ 49.022621] BUG: unable to handle kernel paging request at ffff88042b957e40
[ 49.022624] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[ 49.022627] PGD 2065067 PUD 2068067 PMD 800000042b8001e3
[ 49.022629] Oops: 0011 [#1] PREEMPT SMP
[ 49.022642] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd irqbypass crc32_pclmul amd_iommu_v2 radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd k10temp fam15h_power acpi_cpufreq
[ 49.022645] CPU: 4 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #2
[ 49.022646] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[ 49.022648] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[ 49.022650] RIP: 0010:[<ffff88042b957e40>] [<ffff88042b957e40>] 0xffff88042b957e40
[ 49.022652] RSP: 0018:ffff88042b957e00 EFLAGS: 00010282
[ 49.022653] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[ 49.022654] RDX: 0000000000000001 RSI: ffffffff81062a2d RDI: ffffffff81687d8c
[ 49.022655] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[ 49.022657] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88042b954000
[ 49.022658] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[ 49.022660] FS: 00007fe2cd5dc800(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[ 49.022661] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 49.022662] CR2: ffff88042b957e40 CR3: 0000000429edd000 CR4: 00000000000406e0
[ 49.022663] Stack:
[ 49.022666] ffff88042aca7000 ffff88042aca7000 0000000000000000 ffffffff8117f8ae
[ 49.022668] ffffff9c00000002 ffff880429e6e000 ffffffff81175d5f 0000000000000011
[ 49.022674] 0000000000001c49 0000000000000001 0000000000001180 0000000000000000
[ 49.022675] Call Trace:
[ 49.022680] [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
[ 49.022683] [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
[ 49.022686] [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
[ 49.022689] [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
[ 49.022692] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[ 49.022695] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[ 49.022698] [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[ 49.022725] Code: 00 00 00 ae f8 17 81 ff ff ff ff 02 00 00 00 9c ff ff ff 00 e0 e6 29 04 88 ff ff 5f 5d 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <49> 1c 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00
[ 49.022727] RIP [<ffff88042b957e40>] 0xffff88042b957e40
[ 49.022728] RSP <ffff88042b957e00>
[ 49.022729] CR2: ffff88042b957e40
[ 49.022732] ---[ end trace 6694c76b6124dda9 ]---
[ 49.022911] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 49.022911]
[ 49.030807] Kernel Offset: disabled
[ 49.348267] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
...
[ 39.616661] PM: Basic memory bitmaps freed
[ 39.621491] Restarting tasks ...
[ 39.624829] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 39.624908] done.
[ 39.636878] BUG: unable to handle kernel paging request at ffff88042b957e40
[ 39.636880] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[ 39.636882] PGD 2065067 PUD 2068067 PMD 800000042b8001e3
[ 39.636883] Oops: 0011 [#1] PREEMPT SMP
[ 39.636890] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm irqbypass crc32_pclmul amdkfd amd_iommu_v2 radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq
[ 39.636892] CPU: 6 PID: 1 Comm: init Not tainted 4.7.0-rc4+ #1
[ 39.636893] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[ 39.636894] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[ 39.636895] RIP: 0010:[<ffff88042b957e40>] [<ffff88042b957e40>] 0xffff88042b957e40
[ 39.636895] RSP: 0018:ffff88042b957e00 EFLAGS: 00010282
[ 39.636896] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[ 39.636897] RDX: 0000000000000001 RSI: ffffffff81062a2d RDI: ffffffff81687d8c
[ 39.636897] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[ 39.636898] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88042b954000
[ 39.636898] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[ 39.636899] FS: 00007f45944a4800(0000) GS:ffff88043dd80000(0000) knlGS:0000000000000000
[ 39.636900] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 39.636900] CR2: ffff88042b957e40 CR3: 0000000429015000 CR4: 00000000000406e0
[ 39.636901] Stack:
[ 39.636902] ffff8800b9ec5000 ffff8800b9ec5000 0000000000000000 ffffffff8117f8be
[ 39.636903] ffffff9c00000002 ffff88042ae8aa80 ffffffff81175d6f 0000000000000011
[ 39.636904] 000000000000284c 0000000000000001 0000000000001180 0000000000000000
[ 39.636905] Call Trace:
[ 39.636908] [<ffffffff8117f8be>] ? getname_flags+0x5e/0x1b0
[ 39.636910] [<ffffffff81175d6f>] ? cp_new_stat+0x10f/0x120
[ 39.636912] [<ffffffff810b9eaa>] ? ktime_get_ts64+0x4a/0xf0
[ 39.636917] [<ffffffff81183a67>] ? poll_select_copy_remaining+0xe7/0x130
[ 39.636919] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[ 39.636921] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[ 39.636922] [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[ 39.636939] Code: 00 00 00 be f8 17 81 ff ff ff ff 02 00 00 00 9c ff ff ff 80 aa e8 2a 04 88 ff ff 6f 5d 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <4c> 28 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00
[ 39.636939] RIP [<ffff88042b957e40>] 0xffff88042b957e40
[ 39.636940] RSP <ffff88042b957e00>
[ 39.636940] CR2: ffff88042b957e40
[ 39.636943] ---[ end trace 7b732e7484eb8577 ]---
[ 39.637066] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[ 39.637066]
[ 39.644839] Kernel Offset: disabled
[ 39.944295] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
...
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 9:45 ` Borislav Petkov
@ 2016-06-30 11:27 ` Rafael J. Wysocki
0 siblings, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 11:27 UTC (permalink / raw)
To: Borislav Petkov
Cc: Rafael J. Wysocki, Logan Gunthorpe, Kees Cook, Linus Torvalds,
Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 11:45 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Jun 30, 2016 at 04:20:43AM +0200, Rafael J. Wysocki wrote:
>> That's not what Boris was seeing at least.
>
> Well, I had it a couple of times during testing patches. This is all
> from the logs:
>
> [ 65.121109] PM: Basic memory bitmaps freed
> [ 65.125991] Restarting tasks ...
> [ 65.129342] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> [ 65.129585] done.
> [ 65.141314] BUG: unable to handle kernel paging request at ffff88042b957e40
I mean the failure mode, not the particular exception type. :-)
You always saw it in a user space task after kernel resume:
> [ 65.141340] Call Trace:
> [ 65.141344] [<ffffffff81181e1e>] ? getname_flags+0x5e/0x1b0
> [ 65.141346] [<ffffffff811782bf>] ? cp_new_stat+0x10f/0x120
> [ 65.141348] [<ffffffff810bb33a>] ? ktime_get_ts64+0x4a/0xf0
> [ 65.141353] [<ffffffff81185fc7>] ? poll_select_copy_remaining+0xe7/0x130
> [ 65.141355] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [ 65.141356] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [ 65.141358] [<ffffffff81688e72>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[cut]
> [ 381.850792] Call Trace:
> [ 381.850795] [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
> [ 381.850797] [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
> [ 381.850799] [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
> [ 381.850800] [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
> [ 381.850802] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [ 381.850804] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [ 381.850806] [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[cut]
> [ 49.022675] Call Trace:
> [ 49.022680] [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
> [ 49.022683] [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
> [ 49.022686] [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
> [ 49.022689] [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
> [ 49.022692] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [ 49.022695] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [ 49.022698] [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[cut]
> [ 39.636905] Call Trace:
> [ 39.636908] [<ffffffff8117f8be>] ? getname_flags+0x5e/0x1b0
> [ 39.636910] [<ffffffff81175d6f>] ? cp_new_stat+0x10f/0x120
> [ 39.636912] [<ffffffff810b9eaa>] ? ktime_get_ts64+0x4a/0xf0
> [ 39.636917] [<ffffffff81183a67>] ? poll_select_copy_remaining+0xe7/0x130
> [ 39.636919] [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [ 39.636921] [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [ 39.636922] [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
which is a clear indication of image corruption during restore.
In the Logan's case this happens in swsusp_arch_resume() proper and
the address in RIP is relative to the identity mapping, so the only
place it can happen is the jump to relocated_restore_code. That's
because before that jump the addresses in RIP are relative to the
kernel text mapping and after it we immediately switch over to the
temporary page tables which are all executable. So that is the only
place AFAICS.
Also in your case the failure was 100% reproducible, while in the
Logan's case it has happened once so far (so generally it happens once
in a blue moon).
In summary, I'm sure that this is a different issue.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-27 14:24 ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
2016-06-27 20:08 ` Borislav Petkov
2016-06-27 23:33 ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
@ 2016-06-30 13:17 ` Rafael J. Wysocki
2016-06-30 15:05 ` Borislav Petkov
2016-06-30 16:11 ` [PATCH v5] " Rafael J. Wysocki
2 siblings, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 13:17 UTC (permalink / raw)
To: Logan Gunthorpe, Borislav Petkov
Cc: Kees Cook, Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner,
Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).
That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables. Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.
The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.
To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch. Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.
Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too. That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables. Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.
With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it). Update RESTORE_MAGIC
too to reflect the image header format change.
Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.
This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.
Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
v3 -> v4: The new relocate_restore_code() didn't cover the case when
the page containing the relocated code was mapped via a huge (1G)
page and it didn't clear the NX bit in that case, which led to the
page fault reported by Logan at
http://marc.info/?l=linux-kernel&m=146725158621626&w=4.
Fix that by making relocate_restore_code() handle the huge page
mapping case too.
I've retained the Tested-by: from Kees as this change has nothing to do with
whether or not KASLR works with hibernation after this patch.
Logan, please test this one thoroughly. I'll give it at least one week in
linux-next going forward.
Boris, please test it on the machine where we saw memory corruption with
the previous versions if poss.
Thanks,
Rafael
---
arch/x86/power/hibernate_64.c | 97 +++++++++++++++++++++++++++++++++-----
arch/x86/power/hibernate_asm_64.S | 55 +++++++++------------
2 files changed, 109 insertions(+), 43 deletions(-)
Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
#include <asm/mtrr.h>
#include <asm/sections.h>
#include <asm/suspend.h>
+#include <asm/tlbflush.h>
/* Defined in hibernate_asm_64.S */
extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
* kernel's text (this value is passed in the image header).
*/
unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
/*
* Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
pgd_t *temp_level4_pgt __visible;
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+ pmd_t *pmd;
+ pud_t *pud;
+
+ /*
+ * The new mapping only has to cover the page containing the image
+ * kernel's entry point (jump_address_phys), because the switch over to
+ * it is carried out by relocated code running from a page allocated
+ * specifically for this purpose and covered by the identity mapping, so
+ * the temporary kernel text mapping is only needed for the final jump.
+ * Moreover, in that mapping the virtual address of the image kernel's
+ * entry point must be the same as its virtual address in the image
+ * kernel (restore_jump_address), so the image kernel's
+ * restore_registers() code doesn't find itself in a different area of
+ * the virtual address space after switching over to the original page
+ * tables used by the image kernel.
+ */
+ pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+ if (!pud)
+ return -ENOMEM;
+
+ pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+ if (!pmd)
+ return -ENOMEM;
+
+ set_pmd(pmd + pmd_index(restore_jump_address),
+ __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+ set_pud(pud + pud_index(restore_jump_address),
+ __pud(__pa(pmd) | _KERNPG_TABLE));
+ set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+ __pgd(__pa(pud) | _KERNPG_TABLE));
+
+ return 0;
+}
static void *alloc_pgt_page(void *context)
{
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
if (!temp_level4_pgt)
return -ENOMEM;
- /* It is safe to reuse the original kernel mapping */
- set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
- init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+ /* Prepare a temporary mapping for the kernel text */
+ result = set_up_temporary_text_mapping();
+ if (result)
+ return result;
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,50 @@ static int set_up_temporary_mappings(voi
return 0;
}
+static int relocate_restore_code(void)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+
+ relocated_restore_code = get_safe_page(GFP_ATOMIC);
+ if (!relocated_restore_code)
+ return -ENOMEM;
+
+ memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+ /* Make the page containing the relocated code executable */
+ pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+ pud = pud_offset(pgd, relocated_restore_code);
+ if (pud_large(*pud)) {
+ set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
+ } else {
+ pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
+
+ if (pmd_large(*pmd)) {
+ set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+ } else {
+ pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+ }
+ }
+ flush_tlb_all();
+
+ return 0;
+}
+
int swsusp_arch_resume(void)
{
int error;
/* We have got enough memory and from now on we cannot recover */
- if ((error = set_up_temporary_mappings()))
+ error = set_up_temporary_mappings();
+ if (error)
return error;
- relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
- if (!relocated_restore_code)
- return -ENOMEM;
- memcpy(relocated_restore_code, &core_restore_code,
- &restore_registers - &core_restore_code);
+ error = relocate_restore_code();
+ if (error)
+ return error;
restore_image();
return 0;
@@ -109,11 +179,12 @@ int pfn_is_nosave(unsigned long pfn)
struct restore_data_record {
unsigned long jump_address;
+ unsigned long jump_address_phys;
unsigned long cr3;
unsigned long magic;
};
-#define RESTORE_MAGIC 0x0123456789ABCDEFUL
+#define RESTORE_MAGIC 0x123456789ABCDEF0UL
/**
* arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +197,8 @@ int arch_hibernation_header_save(void *a
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
- rdr->jump_address = restore_jump_address;
+ rdr->jump_address = (unsigned long)&restore_registers;
+ rdr->jump_address_phys = __pa_symbol(&restore_registers);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -142,6 +214,7 @@ int arch_hibernation_header_restore(void
struct restore_data_record *rdr = addr;
restore_jump_address = rdr->jump_address;
+ jump_address_phys = rdr->jump_address_phys;
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
}
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
pushfq
popq pt_regs_flags(%rax)
- /* save the address of restore_registers */
- movq $restore_registers, %rax
- movq %rax, restore_jump_address(%rip)
/* save cr3 */
movq %cr3, %rax
movq %rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
ENDPROC(swsusp_arch_suspend)
ENTRY(restore_image)
- /* switch to temporary page tables */
- movq $__PAGE_OFFSET, %rdx
- movq temp_level4_pgt(%rip), %rax
- subq %rdx, %rax
- movq %rax, %cr3
- /* Flush TLB */
- movq mmu_cr4_features(%rip), %rax
- movq %rax, %rdx
- andq $~(X86_CR4_PGE), %rdx
- movq %rdx, %cr4; # turn off PGE
- movq %cr3, %rcx; # flush TLB
- movq %rcx, %cr3;
- movq %rax, %cr4; # turn PGE back on
-
/* prepare to jump to the image kernel */
- movq restore_jump_address(%rip), %rax
- movq restore_cr3(%rip), %rbx
+ movq restore_jump_address(%rip), %r8
+ movq restore_cr3(%rip), %r9
+
+ /* prepare to switch to temporary page tables */
+ movq temp_level4_pgt(%rip), %rax
+ movq mmu_cr4_features(%rip), %rbx
/* prepare to copy image data to their original locations */
movq restore_pblist(%rip), %rdx
+
+ /* jump to relocated restore code */
movq relocated_restore_code(%rip), %rcx
jmpq *%rcx
/* code below has been relocated to a safe page */
ENTRY(core_restore_code)
+ /* switch to temporary page tables */
+ movq $__PAGE_OFFSET, %rcx
+ subq %rcx, %rax
+ movq %rax, %cr3
+ /* flush TLB */
+ movq %rbx, %rcx
+ andq $~(X86_CR4_PGE), %rcx
+ movq %rcx, %cr4; # turn off PGE
+ movq %cr3, %rcx; # flush TLB
+ movq %rcx, %cr3;
+ movq %rbx, %cr4; # turn PGE back on
.Lloop:
testq %rdx, %rdx
jz .Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
/* progress to the next pbe */
movq pbe_next(%rdx), %rdx
jmp .Lloop
+
.Ldone:
/* jump to the restore_registers address from the image header */
- jmpq *%rax
- /*
- * NOTE: This assumes that the boot kernel's text mapping covers the
- * image kernel's page containing restore_registers and the address of
- * this page is the same as in the image kernel's text mapping (it
- * should always be true, because the text mapping is linear, starting
- * from 0, and is supposed to cover the entire kernel text for every
- * kernel).
- *
- * code below belongs to the image kernel
- */
+ jmpq *%r8
+ /* code below belongs to the image kernel */
+ .align PAGE_SIZE
ENTRY(restore_registers)
FRAME_BEGIN
/* go back to the original page tables */
- movq %rbx, %cr3
+ movq %r9, %cr3
/* Flush TLB, including "global" things (vmalloc) */
movq mmu_cr4_features(%rip), %rax
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 13:17 ` [PATCH v4] " Rafael J. Wysocki
@ 2016-06-30 15:05 ` Borislav Petkov
2016-06-30 15:17 ` Rafael J. Wysocki
2016-06-30 16:11 ` [PATCH v5] " Rafael J. Wysocki
1 sibling, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-30 15:05 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Logan Gunthorpe, Kees Cook, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
...
> +static int relocate_restore_code(void)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> +
> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
> + if (!relocated_restore_code)
> + return -ENOMEM;
> +
> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> + /* Make the page containing the relocated code executable */
> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> + pud = pud_offset(pgd, relocated_restore_code);
> + if (pud_large(*pud)) {
> + set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
> + } else {
> + pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
> +
> + if (pmd_large(*pmd)) {
> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> + } else {
> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> + }
> + }
> + flush_tlb_all();
I know you want to flush TLBs but this causes the splat below on the
resume kernel.
Most likely because:
resume_target_kernel() does local_irq_disable() and then
swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
and smp_call_function_many() doesn't like it when IRQs are disabled.
[ 7.613645] Disabling non-boot CPUs ...
[ 7.902408] ------------[ cut here ]------------
[ 7.907106] WARNING: CPU: 0 PID: 1 at kernel/smp.c:416 smp_call_function_many+0xb6/0x260
[ 7.915319] Modules linked in:
[ 7.918501] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #11
[ 7.924931] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[ 7.934967] 0000000000000000 ffff88042b957cf8 ffffffff812ac1c3 0000000000000000
[ 7.942664] 0000000000000000 ffff88042b957d38 ffffffff8105435d 000001a02b957d28
[ 7.950369] 0000000000000000 0000000000000000 ffffffff8104d420 0000000000000000
[ 7.958072] Call Trace:
[ 7.960598] [<ffffffff812ac1c3>] dump_stack+0x67/0x94
[ 7.965815] [<ffffffff8105435d>] __warn+0xdd/0x100
[ 7.970771] [<ffffffff8104d420>] ? leave_mm+0xc0/0xc0
[ 7.975981] [<ffffffff8105444d>] warn_slowpath_null+0x1d/0x20
[ 7.981891] [<ffffffff810cb526>] smp_call_function_many+0xb6/0x260
[ 7.988236] [<ffffffff8104d420>] ? leave_mm+0xc0/0xc0
[ 7.993452] [<ffffffff810cb716>] smp_call_function+0x46/0x80
[ 7.999277] [<ffffffff8104d420>] ? leave_mm+0xc0/0xc0
[ 8.004494] [<ffffffff810cb78e>] on_each_cpu+0x3e/0xa0
[ 8.009790] [<ffffffff81098e00>] ? hibernation_restore+0x130/0x130
[ 8.016135] [<ffffffff8104debc>] flush_tlb_all+0x1c/0x20
[ 8.021613] [<ffffffff815bd8d4>] swsusp_arch_resume+0x254/0x2b0
[ 8.027696] [<ffffffff815bd660>] ? restore_processor_state+0x2f0/0x2f0
[ 8.034387] [<ffffffff81098d9d>] hibernation_restore+0xcd/0x130
[ 8.040464] [<ffffffff81112fbd>] software_resume.part.6+0x1f9/0x25b
[ 8.046894] [<ffffffff81098e26>] software_resume+0x26/0x30
[ 8.052545] [<ffffffff81000449>] do_one_initcall+0x59/0x190
[ 8.058282] [<ffffffff81071b3c>] ? parse_args+0x26c/0x3f0
[ 8.063867] [<ffffffff8168b000>] ? _raw_read_unlock_irqrestore+0x30/0x60
[ 8.070730] [<ffffffff81cd5002>] kernel_init_freeable+0x118/0x19e
[ 8.076986] [<ffffffff816851ae>] kernel_init+0xe/0x100
[ 8.082290] [<ffffffff8168b75f>] ret_from_fork+0x1f/0x40
[ 8.087768] [<ffffffff816851a0>] ? rest_init+0x90/0x90
[ 8.093073] ---[ end trace 6361ce069253f25c ]---
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 15:05 ` Borislav Petkov
@ 2016-06-30 15:17 ` Rafael J. Wysocki
2016-06-30 15:24 ` Andy Lutomirski
0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 15:17 UTC (permalink / raw)
To: Borislav Petkov
Cc: Rafael J. Wysocki, Logan Gunthorpe, Kees Cook, Linus Torvalds,
Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Logan Gunthorpe reports that hibernation stopped working reliably for
>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>> and rodata).
>
> ...
>
>> +static int relocate_restore_code(void)
>> +{
>> + pgd_t *pgd;
>> + pud_t *pud;
>> +
>> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
>> + if (!relocated_restore_code)
>> + return -ENOMEM;
>> +
>> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>> +
>> + /* Make the page containing the relocated code executable */
>> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>> + pud = pud_offset(pgd, relocated_restore_code);
>> + if (pud_large(*pud)) {
>> + set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>> + } else {
>> + pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>> +
>> + if (pmd_large(*pmd)) {
>> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>> + } else {
>> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>> +
>> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>> + }
>> + }
>> + flush_tlb_all();
>
> I know you want to flush TLBs but this causes the splat below on the
> resume kernel.
>
> Most likely because:
>
> resume_target_kernel() does local_irq_disable() and then
>
> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>
> and smp_call_function_many() doesn't like it when IRQs are disabled.
Right.
Can I invoke __flush_tlb_all() from here to flush the TLB on the local
CPU? That should be sufficient IMO.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 15:17 ` Rafael J. Wysocki
@ 2016-06-30 15:24 ` Andy Lutomirski
2016-06-30 15:29 ` Rafael J. Wysocki
0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2016-06-30 15:24 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Borislav Petkov, Rafael J. Wysocki, Logan Gunthorpe, Kees Cook,
Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 8:17 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
>> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>> and rodata).
>>
>> ...
>>
>>> +static int relocate_restore_code(void)
>>> +{
>>> + pgd_t *pgd;
>>> + pud_t *pud;
>>> +
>>> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>> + if (!relocated_restore_code)
>>> + return -ENOMEM;
>>> +
>>> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>>> +
>>> + /* Make the page containing the relocated code executable */
>>> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>>> + pud = pud_offset(pgd, relocated_restore_code);
>>> + if (pud_large(*pud)) {
>>> + set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>>> + } else {
>>> + pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>>> +
>>> + if (pmd_large(*pmd)) {
>>> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>> + } else {
>>> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>>> +
>>> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>> + }
>>> + }
>>> + flush_tlb_all();
>>
>> I know you want to flush TLBs but this causes the splat below on the
>> resume kernel.
>>
>> Most likely because:
>>
>> resume_target_kernel() does local_irq_disable() and then
>>
>> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>>
>> and smp_call_function_many() doesn't like it when IRQs are disabled.
>
> Right.
>
> Can I invoke __flush_tlb_all() from here to flush the TLB on the local
> CPU? That should be sufficient IMO.
Yes, unless there's another CPU already up at the time, but that seems unlikely.
FWIW, the CPU can and will cache the NX bit and the CPU will *not*
refresh the TLB before sending a page fault. You definitely need to
flush. I wonder if this is also why you're seeing a certain amount of
random corruption.
--Andy
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 15:24 ` Andy Lutomirski
@ 2016-06-30 15:29 ` Rafael J. Wysocki
2016-06-30 17:23 ` Andy Lutomirski
0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 15:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rafael J. Wysocki, Borislav Petkov, Rafael J. Wysocki,
Logan Gunthorpe, Kees Cook, Linus Torvalds, Thomas Gleixner,
Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 5:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jun 30, 2016 at 8:17 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
>>> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>>> and rodata).
>>>
>>> ...
>>>
>>>> +static int relocate_restore_code(void)
>>>> +{
>>>> + pgd_t *pgd;
>>>> + pud_t *pud;
>>>> +
>>>> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>>> + if (!relocated_restore_code)
>>>> + return -ENOMEM;
>>>> +
>>>> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>>>> +
>>>> + /* Make the page containing the relocated code executable */
>>>> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>>>> + pud = pud_offset(pgd, relocated_restore_code);
>>>> + if (pud_large(*pud)) {
>>>> + set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>>>> + } else {
>>>> + pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>>>> +
>>>> + if (pmd_large(*pmd)) {
>>>> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>>> + } else {
>>>> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>>>> +
>>>> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>>> + }
>>>> + }
>>>> + flush_tlb_all();
>>>
>>> I know you want to flush TLBs but this causes the splat below on the
>>> resume kernel.
>>>
>>> Most likely because:
>>>
>>> resume_target_kernel() does local_irq_disable() and then
>>>
>>> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>>>
>>> and smp_call_function_many() doesn't like it when IRQs are disabled.
>>
>> Right.
>>
>> Can I invoke __flush_tlb_all() from here to flush the TLB on the local
>> CPU? That should be sufficient IMO.
>
> Yes, unless there's another CPU already up at the time, but that seems unlikely.
They are all offline at that time.
OK, let me do that then.
> FWIW, the CPU can and will cache the NX bit and the CPU will *not*
> refresh the TLB before sending a page fault. You definitely need to
> flush.
OK
> I wonder if this is also why you're seeing a certain amount of
> random corruption.
Well, the corruption we saw was not really random. I can give you all
of the details if you're interested, maybe you'll be able to figure
out what might be happening. :-)
Thanks,
Rafael
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 15:29 ` Rafael J. Wysocki
@ 2016-06-30 17:23 ` Andy Lutomirski
0 siblings, 0 replies; 39+ messages in thread
From: Andy Lutomirski @ 2016-06-30 17:23 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Borislav Petkov, Rafael J. Wysocki, Logan Gunthorpe, Kees Cook,
Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 8:29 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Jun 30, 2016 at 5:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Jun 30, 2016 at 8:17 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>> On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
>>>> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>
>>>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>>>> and rodata).
>>>>
>>>> ...
>>>>
>>>>> +static int relocate_restore_code(void)
>>>>> +{
>>>>> + pgd_t *pgd;
>>>>> + pud_t *pud;
>>>>> +
>>>>> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>>>> + if (!relocated_restore_code)
>>>>> + return -ENOMEM;
>>>>> +
>>>>> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>>>>> +
>>>>> + /* Make the page containing the relocated code executable */
>>>>> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>>>>> + pud = pud_offset(pgd, relocated_restore_code);
>>>>> + if (pud_large(*pud)) {
>>>>> + set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>>>>> + } else {
>>>>> + pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>>>>> +
>>>>> + if (pmd_large(*pmd)) {
>>>>> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>>>> + } else {
>>>>> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>>>>> +
>>>>> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>>>> + }
>>>>> + }
>>>>> + flush_tlb_all();
>>>>
>>>> I know you want to flush TLBs but this causes the splat below on the
>>>> resume kernel.
>>>>
>>>> Most likely because:
>>>>
>>>> resume_target_kernel() does local_irq_disable() and then
>>>>
>>>> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>>>>
>>>> and smp_call_function_many() doesn't like it when IRQs are disabled.
>>>
>>> Right.
>>>
>>> Can I invoke __flush_tlb_all() from here to flush the TLB on the local
>>> CPU? That should be sufficient IMO.
>>
>> Yes, unless there's another CPU already up at the time, but that seems unlikely.
>
> They are all offline at that time.
>
> OK, let me do that then.
>
>> FWIW, the CPU can and will cache the NX bit and the CPU will *not*
>> refresh the TLB before sending a page fault. You definitely need to
>> flush.
>
> OK
>
>> I wonder if this is also why you're seeing a certain amount of
>> random corruption.
>
> Well, the corruption we saw was not really random. I can give you all
> of the details if you're interested, maybe you'll be able to figure
> out what might be happening. :-)
Sure, or at least in abbreviated form. No guarantee that I'll have
any clue, though.
--Andy
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 13:17 ` [PATCH v4] " Rafael J. Wysocki
2016-06-30 15:05 ` Borislav Petkov
@ 2016-06-30 16:11 ` Rafael J. Wysocki
2016-06-30 17:02 ` Borislav Petkov
2016-06-30 21:47 ` Logan Gunthorpe
1 sibling, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 16:11 UTC (permalink / raw)
To: Logan Gunthorpe, Borislav Petkov
Cc: Kees Cook, Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner,
Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).
That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables. Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.
The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.
To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch. Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.
Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too. That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables. Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.
With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it). Update RESTORE_MAGIC
too to reflect the image header format change.
Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.
This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.
Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
v4 -> v5: Replace flush_tlb_all() in relocate_restore_code() with
__flush_tlb_all() (which still should be sufficient) to avoid
complaints from smp_call_function_many() about disabled interrupts.
v3 -> v4: The new relocate_restore_code() didn't cover the case when
the page containing the relocated code was mapped via a huge (1G)
page and it didn't clear the NX bit in that case, which led to the
page fault reported by Logan at
http://marc.info/?l=linux-kernel&m=146725158621626&w=4.
Fix that by making relocate_restore_code() handle the huge page
mapping case too.
I've retained the Tested-by: from Kees as these changes have nothing to do with
whether or not KASLR works with hibernation after this patch.
Logan, please test this one thoroughly. I'll give it at least one week in
linux-next going forward.
Boris, please test it on the machine where we saw memory corruption with
the previous versions if poss.
---
arch/x86/power/hibernate_64.c | 97 +++++++++++++++++++++++++++++++++-----
arch/x86/power/hibernate_asm_64.S | 55 +++++++++------------
2 files changed, 109 insertions(+), 43 deletions(-)
Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
#include <asm/mtrr.h>
#include <asm/sections.h>
#include <asm/suspend.h>
+#include <asm/tlbflush.h>
/* Defined in hibernate_asm_64.S */
extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
* kernel's text (this value is passed in the image header).
*/
unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
/*
* Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
pgd_t *temp_level4_pgt __visible;
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+ pmd_t *pmd;
+ pud_t *pud;
+
+ /*
+ * The new mapping only has to cover the page containing the image
+ * kernel's entry point (jump_address_phys), because the switch over to
+ * it is carried out by relocated code running from a page allocated
+ * specifically for this purpose and covered by the identity mapping, so
+ * the temporary kernel text mapping is only needed for the final jump.
+ * Moreover, in that mapping the virtual address of the image kernel's
+ * entry point must be the same as its virtual address in the image
+ * kernel (restore_jump_address), so the image kernel's
+ * restore_registers() code doesn't find itself in a different area of
+ * the virtual address space after switching over to the original page
+ * tables used by the image kernel.
+ */
+ pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+ if (!pud)
+ return -ENOMEM;
+
+ pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+ if (!pmd)
+ return -ENOMEM;
+
+ set_pmd(pmd + pmd_index(restore_jump_address),
+ __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+ set_pud(pud + pud_index(restore_jump_address),
+ __pud(__pa(pmd) | _KERNPG_TABLE));
+ set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+ __pgd(__pa(pud) | _KERNPG_TABLE));
+
+ return 0;
+}
static void *alloc_pgt_page(void *context)
{
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
if (!temp_level4_pgt)
return -ENOMEM;
- /* It is safe to reuse the original kernel mapping */
- set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
- init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+ /* Prepare a temporary mapping for the kernel text */
+ result = set_up_temporary_text_mapping();
+ if (result)
+ return result;
/* Set up the direct mapping from scratch */
for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,50 @@ static int set_up_temporary_mappings(voi
return 0;
}
+static int relocate_restore_code(void)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+
+ relocated_restore_code = get_safe_page(GFP_ATOMIC);
+ if (!relocated_restore_code)
+ return -ENOMEM;
+
+ memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+ /* Make the page containing the relocated code executable */
+ pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+ pud = pud_offset(pgd, relocated_restore_code);
+ if (pud_large(*pud)) {
+ set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
+ } else {
+ pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
+
+ if (pmd_large(*pmd)) {
+ set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+ } else {
+ pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+ set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+ }
+ }
+ __flush_tlb_all();
+
+ return 0;
+}
+
int swsusp_arch_resume(void)
{
int error;
/* We have got enough memory and from now on we cannot recover */
- if ((error = set_up_temporary_mappings()))
+ error = set_up_temporary_mappings();
+ if (error)
return error;
- relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
- if (!relocated_restore_code)
- return -ENOMEM;
- memcpy(relocated_restore_code, &core_restore_code,
- &restore_registers - &core_restore_code);
+ error = relocate_restore_code();
+ if (error)
+ return error;
restore_image();
return 0;
@@ -109,11 +179,12 @@ int pfn_is_nosave(unsigned long pfn)
struct restore_data_record {
unsigned long jump_address;
+ unsigned long jump_address_phys;
unsigned long cr3;
unsigned long magic;
};
-#define RESTORE_MAGIC 0x0123456789ABCDEFUL
+#define RESTORE_MAGIC 0x123456789ABCDEF0UL
/**
* arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +197,8 @@ int arch_hibernation_header_save(void *a
if (max_size < sizeof(struct restore_data_record))
return -EOVERFLOW;
- rdr->jump_address = restore_jump_address;
+ rdr->jump_address = (unsigned long)&restore_registers;
+ rdr->jump_address_phys = __pa_symbol(&restore_registers);
rdr->cr3 = restore_cr3;
rdr->magic = RESTORE_MAGIC;
return 0;
@@ -142,6 +214,7 @@ int arch_hibernation_header_restore(void
struct restore_data_record *rdr = addr;
restore_jump_address = rdr->jump_address;
+ jump_address_phys = rdr->jump_address_phys;
restore_cr3 = rdr->cr3;
return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
}
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
pushfq
popq pt_regs_flags(%rax)
- /* save the address of restore_registers */
- movq $restore_registers, %rax
- movq %rax, restore_jump_address(%rip)
/* save cr3 */
movq %cr3, %rax
movq %rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
ENDPROC(swsusp_arch_suspend)
ENTRY(restore_image)
- /* switch to temporary page tables */
- movq $__PAGE_OFFSET, %rdx
- movq temp_level4_pgt(%rip), %rax
- subq %rdx, %rax
- movq %rax, %cr3
- /* Flush TLB */
- movq mmu_cr4_features(%rip), %rax
- movq %rax, %rdx
- andq $~(X86_CR4_PGE), %rdx
- movq %rdx, %cr4; # turn off PGE
- movq %cr3, %rcx; # flush TLB
- movq %rcx, %cr3;
- movq %rax, %cr4; # turn PGE back on
-
/* prepare to jump to the image kernel */
- movq restore_jump_address(%rip), %rax
- movq restore_cr3(%rip), %rbx
+ movq restore_jump_address(%rip), %r8
+ movq restore_cr3(%rip), %r9
+
+ /* prepare to switch to temporary page tables */
+ movq temp_level4_pgt(%rip), %rax
+ movq mmu_cr4_features(%rip), %rbx
/* prepare to copy image data to their original locations */
movq restore_pblist(%rip), %rdx
+
+ /* jump to relocated restore code */
movq relocated_restore_code(%rip), %rcx
jmpq *%rcx
/* code below has been relocated to a safe page */
ENTRY(core_restore_code)
+ /* switch to temporary page tables */
+ movq $__PAGE_OFFSET, %rcx
+ subq %rcx, %rax
+ movq %rax, %cr3
+ /* flush TLB */
+ movq %rbx, %rcx
+ andq $~(X86_CR4_PGE), %rcx
+ movq %rcx, %cr4; # turn off PGE
+ movq %cr3, %rcx; # flush TLB
+ movq %rcx, %cr3;
+ movq %rbx, %cr4; # turn PGE back on
.Lloop:
testq %rdx, %rdx
jz .Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
/* progress to the next pbe */
movq pbe_next(%rdx), %rdx
jmp .Lloop
+
.Ldone:
/* jump to the restore_registers address from the image header */
- jmpq *%rax
- /*
- * NOTE: This assumes that the boot kernel's text mapping covers the
- * image kernel's page containing restore_registers and the address of
- * this page is the same as in the image kernel's text mapping (it
- * should always be true, because the text mapping is linear, starting
- * from 0, and is supposed to cover the entire kernel text for every
- * kernel).
- *
- * code below belongs to the image kernel
- */
+ jmpq *%r8
+ /* code below belongs to the image kernel */
+ .align PAGE_SIZE
ENTRY(restore_registers)
FRAME_BEGIN
/* go back to the original page tables */
- movq %rbx, %cr3
+ movq %r9, %cr3
/* Flush TLB, including "global" things (vmalloc) */
movq mmu_cr4_features(%rip), %rax
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 16:11 ` [PATCH v5] " Rafael J. Wysocki
@ 2016-06-30 17:02 ` Borislav Petkov
2016-06-30 21:47 ` Logan Gunthorpe
1 sibling, 0 replies; 39+ messages in thread
From: Borislav Petkov @ 2016-06-30 17:02 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Logan Gunthorpe, Kees Cook, Linus Torvalds, Rafael J. Wysocki,
Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
H. Peter Anvin, Linux PM list, Stephen Smalley
On Thu, Jun 30, 2016 at 06:11:41PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables. Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
...
> Boris, please test it on the machine where we saw memory corruption with
> the previous versions if poss.
Looks good. 5 runs passed without a hiccup.
Reported-and-tested-by: Borislav Petkov <bp@suse.de>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
2016-06-30 16:11 ` [PATCH v5] " Rafael J. Wysocki
2016-06-30 17:02 ` Borislav Petkov
@ 2016-06-30 21:47 ` Logan Gunthorpe
1 sibling, 0 replies; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-30 21:47 UTC (permalink / raw)
To: Rafael J. Wysocki, Borislav Petkov
Cc: Kees Cook, Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner,
Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
Linux PM list, Stephen Smalley
Hey,
So, I've done a couple hundred hibernate resume cycles with v5 of the
patch and so far it looks good. I'll let you know if I hit any bugs in
more typical usage over the coming week.
Thanks,
Logan
On 30/06/16 10:11 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables. Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch. Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too. That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables. Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it). Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> v4 -> v5: Replace flush_tlb_all() in relocate_restore_code() with
> __flush_tlb_all() (which still should be sufficient) to avoid
> complaints from smp_call_function_many() about disabled interrupts.
>
> v3 -> v4: The new relocate_restore_code() didn't cover the case when
> the page containing the relocated code was mapped via a huge (1G)
> page and it didn't clear the NX bit in that case, which led to the
> page fault reported by Logan at
> http://marc.info/?l=linux-kernel&m=146725158621626&w=4.
> Fix that by making relocate_restore_code() handle the huge page
> mapping case too.
>
> I've retained the Tested-by: from Kees as these changes have nothing to do with
> whether or not KASLR works with hibernation after this patch.
>
> Logan, please test this one thoroughly. I'll give it at least one week in
> linux-next going forward.
>
> Boris, please test it on the machine where we saw memory corruption with
> the previous versions if poss.
>
> ---
> arch/x86/power/hibernate_64.c | 97 +++++++++++++++++++++++++++++++++-----
> arch/x86/power/hibernate_asm_64.S | 55 +++++++++------------
> 2 files changed, 109 insertions(+), 43 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -19,6 +19,7 @@
> #include <asm/mtrr.h>
> #include <asm/sections.h>
> #include <asm/suspend.h>
> +#include <asm/tlbflush.h>
>
> /* Defined in hibernate_asm_64.S */
> extern asmlinkage __visible int restore_image(void);
> @@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
> * kernel's text (this value is passed in the image header).
> */
> unsigned long restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
> /*
> * Value of the cr3 register from before the hibernation (this value is passed
> @@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
>
> pgd_t *temp_level4_pgt __visible;
>
> -void *relocated_restore_code __visible;
> +unsigned long relocated_restore_code __visible;
> +
> +static int set_up_temporary_text_mapping(void)
> +{
> + pmd_t *pmd;
> + pud_t *pud;
> +
> + /*
> + * The new mapping only has to cover the page containing the image
> + * kernel's entry point (jump_address_phys), because the switch over to
> + * it is carried out by relocated code running from a page allocated
> + * specifically for this purpose and covered by the identity mapping, so
> + * the temporary kernel text mapping is only needed for the final jump.
> + * Moreover, in that mapping the virtual address of the image kernel's
> + * entry point must be the same as its virtual address in the image
> + * kernel (restore_jump_address), so the image kernel's
> + * restore_registers() code doesn't find itself in a different area of
> + * the virtual address space after switching over to the original page
> + * tables used by the image kernel.
> + */
> + pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> + if (!pud)
> + return -ENOMEM;
> +
> + pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> + if (!pmd)
> + return -ENOMEM;
> +
> + set_pmd(pmd + pmd_index(restore_jump_address),
> + __pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
> + set_pud(pud + pud_index(restore_jump_address),
> + __pud(__pa(pmd) | _KERNPG_TABLE));
> + set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
> + __pgd(__pa(pud) | _KERNPG_TABLE));
> +
> + return 0;
> +}
>
> static void *alloc_pgt_page(void *context)
> {
> @@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
> if (!temp_level4_pgt)
> return -ENOMEM;
>
> - /* It is safe to reuse the original kernel mapping */
> - set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> - init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> + /* Prepare a temporary mapping for the kernel text */
> + result = set_up_temporary_text_mapping();
> + if (result)
> + return result;
>
> /* Set up the direct mapping from scratch */
> for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +117,50 @@ static int set_up_temporary_mappings(voi
> return 0;
> }
>
> +static int relocate_restore_code(void)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> +
> + relocated_restore_code = get_safe_page(GFP_ATOMIC);
> + if (!relocated_restore_code)
> + return -ENOMEM;
> +
> + memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> + /* Make the page containing the relocated code executable */
> + pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> + pud = pud_offset(pgd, relocated_restore_code);
> + if (pud_large(*pud)) {
> + set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
> + } else {
> + pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
> +
> + if (pmd_large(*pmd)) {
> + set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> + } else {
> + pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> + set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> + }
> + }
> + __flush_tlb_all();
> +
> + return 0;
> +}
> +
> int swsusp_arch_resume(void)
> {
> int error;
>
> /* We have got enough memory and from now on we cannot recover */
> - if ((error = set_up_temporary_mappings()))
> + error = set_up_temporary_mappings();
> + if (error)
> return error;
>
> - relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> - if (!relocated_restore_code)
> - return -ENOMEM;
> - memcpy(relocated_restore_code, &core_restore_code,
> - &restore_registers - &core_restore_code);
> + error = relocate_restore_code();
> + if (error)
> + return error;
>
> restore_image();
> return 0;
> @@ -109,11 +179,12 @@ int pfn_is_nosave(unsigned long pfn)
>
> struct restore_data_record {
> unsigned long jump_address;
> + unsigned long jump_address_phys;
> unsigned long cr3;
> unsigned long magic;
> };
>
> -#define RESTORE_MAGIC 0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC 0x123456789ABCDEF0UL
>
> /**
> * arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +197,8 @@ int arch_hibernation_header_save(void *a
>
> if (max_size < sizeof(struct restore_data_record))
> return -EOVERFLOW;
> - rdr->jump_address = restore_jump_address;
> + rdr->jump_address = (unsigned long)&restore_registers;
> + rdr->jump_address_phys = __pa_symbol(&restore_registers);
> rdr->cr3 = restore_cr3;
> rdr->magic = RESTORE_MAGIC;
> return 0;
> @@ -142,6 +214,7 @@ int arch_hibernation_header_restore(void
> struct restore_data_record *rdr = addr;
>
> restore_jump_address = rdr->jump_address;
> + jump_address_phys = rdr->jump_address_phys;
> restore_cr3 = rdr->cr3;
> return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
> }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
> pushfq
> popq pt_regs_flags(%rax)
>
> - /* save the address of restore_registers */
> - movq $restore_registers, %rax
> - movq %rax, restore_jump_address(%rip)
> /* save cr3 */
> movq %cr3, %rax
> movq %rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
> ENDPROC(swsusp_arch_suspend)
>
> ENTRY(restore_image)
> - /* switch to temporary page tables */
> - movq $__PAGE_OFFSET, %rdx
> - movq temp_level4_pgt(%rip), %rax
> - subq %rdx, %rax
> - movq %rax, %cr3
> - /* Flush TLB */
> - movq mmu_cr4_features(%rip), %rax
> - movq %rax, %rdx
> - andq $~(X86_CR4_PGE), %rdx
> - movq %rdx, %cr4; # turn off PGE
> - movq %cr3, %rcx; # flush TLB
> - movq %rcx, %cr3;
> - movq %rax, %cr4; # turn PGE back on
> -
> /* prepare to jump to the image kernel */
> - movq restore_jump_address(%rip), %rax
> - movq restore_cr3(%rip), %rbx
> + movq restore_jump_address(%rip), %r8
> + movq restore_cr3(%rip), %r9
> +
> + /* prepare to switch to temporary page tables */
> + movq temp_level4_pgt(%rip), %rax
> + movq mmu_cr4_features(%rip), %rbx
>
> /* prepare to copy image data to their original locations */
> movq restore_pblist(%rip), %rdx
> +
> + /* jump to relocated restore code */
> movq relocated_restore_code(%rip), %rcx
> jmpq *%rcx
>
> /* code below has been relocated to a safe page */
> ENTRY(core_restore_code)
> + /* switch to temporary page tables */
> + movq $__PAGE_OFFSET, %rcx
> + subq %rcx, %rax
> + movq %rax, %cr3
> + /* flush TLB */
> + movq %rbx, %rcx
> + andq $~(X86_CR4_PGE), %rcx
> + movq %rcx, %cr4; # turn off PGE
> + movq %cr3, %rcx; # flush TLB
> + movq %rcx, %cr3;
> + movq %rbx, %cr4; # turn PGE back on
> .Lloop:
> testq %rdx, %rdx
> jz .Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
> /* progress to the next pbe */
> movq pbe_next(%rdx), %rdx
> jmp .Lloop
> +
> .Ldone:
> /* jump to the restore_registers address from the image header */
> - jmpq *%rax
> - /*
> - * NOTE: This assumes that the boot kernel's text mapping covers the
> - * image kernel's page containing restore_registers and the address of
> - * this page is the same as in the image kernel's text mapping (it
> - * should always be true, because the text mapping is linear, starting
> - * from 0, and is supposed to cover the entire kernel text for every
> - * kernel).
> - *
> - * code below belongs to the image kernel
> - */
> + jmpq *%r8
>
> + /* code below belongs to the image kernel */
> + .align PAGE_SIZE
> ENTRY(restore_registers)
> FRAME_BEGIN
> /* go back to the original page tables */
> - movq %rbx, %cr3
> + movq %r9, %cr3
>
> /* Flush TLB, including "global" things (vmalloc) */
> movq mmu_cr4_features(%rip), %rax
>
^ permalink raw reply [flat|nested] 39+ messages in thread