All of lore.kernel.org
 help / color / mirror / Atom feed
* ktime_get_ts64() splat during resume
@ 2016-06-17 10:54 Borislav Petkov
  2016-06-17 11:53 ` Thomas Gleixner
  0 siblings, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-17 10:54 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra; +Cc: lkml

Hi guys,

look what I've found this morning during resume:

[   45.732934] PM: Image restored successfully.
[   45.738064] PM: Basic memory bitmaps freed
[   45.742914] Restarting tasks ... 
[   45.746236] BUG: unable to handle kernel done.
[   45.752542] NULL pointer dereference at 0000000000000001
[   45.752544] IP: [<0000000000000001>] 0x1
[   45.752547] PGD 37922067 PUD 3791a067 PMD 0 
[   45.752548] Oops: 0010 [#1] PREEMPT SMP
[   45.752557] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd amd_iommu_v2 irqbypass crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq
[   45.752559] CPU: 2 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1
[   45.752560] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[   45.752560] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[   45.752562] RIP: 0010:[<0000000000000001>]  [<0000000000000001>] 0x1
[   45.752562] RSP: 0018:ffff88042b957e50  EFLAGS: 00010246
[   45.752563] RAX: 0000000000000000 RBX: ffffffff81181e1e RCX: fffffffffffffdfe
[   45.752564] RDX: ffff88042b958000 RSI: ffff88042b954000 RDI: ffffffff8168aeac
[   45.752564] RBP: 0000000000000430 R08: 0000000000000000 R09: 0000000000000002
[   45.752565] R10: 0000000000000000 R11: 0000000000000001 R12: ffffff9c00000002
[   45.752565] R13: ffff88042b09a300 R14: ffffffff811782bf R15: 0000000000000011
[   45.752566] FS:  00007f9d470ab800(0000) GS:ffff88043dc80000(0000) knlGS:0000000000000000
[   45.752567] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   45.752567] CR2: 0000000000000001 CR3: 0000000037923000 CR4: 00000000000406e0
[   45.752568] Stack:
[   45.752569]  0000000000001180 0000000000000000 ffff88042b957e88 ffffffff810bb33a
[   45.752570]  00000000fffffdfe ffff88042b957f00 00007ffcfa91b1f0 ffff88042b957ee8
[   45.752571]  ffffffff81185fc7 0000000000000003 0000000021b8e1cd 0000000000000003
[   45.752571] Call Trace:
[   45.752576]  [<ffffffff810bb33a>] ? ktime_get_ts64+0x4a/0xf0
[   45.752578]  [<ffffffff81185fc7>] ? poll_select_copy_remaining+0xe7/0x130
[   45.752581]  [<ffffffff8100263a>] ? exit_to_usermode_loop+0x8a/0xb0
[   45.752582]  [<ffffffff81002a6b>] ? syscall_return_slowpath+0x5b/0x70
[   45.752584]  [<ffffffff8168b372>] ? entry_SYSCALL_64_fastpath+0xa5/0xa7
[   45.752587] Code:  Bad RIP value.
[   45.752588] RIP  [<0000000000000001>] 0x1
[   45.752589]  RSP <ffff88042b957e50>
[   45.752589] CR2: 0000000000000001
[   45.752597] ---[ end trace 5334fe9eec2bfca9 ]---
[   45.752737] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[   45.752737] 
[   45.758773] Kernel Offset: disabled

rIP is gone.

That's rc3+tip/master from Wed:

commit 043ae67b6fdd9db7fb54564124d2f2fa833c5ee6 (refs/remotes/tip/master)
Merge: 0bbd6fbba8c7 00688272157d
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed Jun 15 12:53:02 2016 +0200

    Merge branch 'x86/platform'

The top function in the stacktrace is:

[   45.752576]  [<ffffffff810bb33a>] ? ktime_get_ts64+0x4a/0xf0

and that address is:

ffffffff810bb2f0 <ktime_get_ts64>:
ffffffff810bb2f0:       e8 6b 21 5d 00          callq  ffffffff8168d460 <__fentry__>
ffffffff810bb2f5:       55                      push   %rbp
ffffffff810bb2f6:       8b 05 14 52 bf 00       mov    0xbf5214(%rip),%eax        # ffffffff81cb0510 <timekeeping_suspended>
ffffffff810bb2fc:       48 89 e5                mov    %rsp,%rbp
ffffffff810bb2ff:       41 55                   push   %r13
ffffffff810bb301:       85 c0                   test   %eax,%eax
ffffffff810bb303:       41 54                   push   %r12
ffffffff810bb305:       53                      push   %rbx
ffffffff810bb306:       48 89 fb                mov    %rdi,%rbx
ffffffff810bb309:       0f 85 b6 00 00 00       jne    ffffffff810bb3c5 <ktime_get_ts64+0xd5>
ffffffff810bb30f:       45 31 e4                xor    %r12d,%r12d
ffffffff810bb312:       44 8b 2d a7 a9 f5 00    mov    0xf5a9a7(%rip),%r13d        # ffffffff82015cc0 <tk_core>
ffffffff810bb319:       41 f6 c5 01             test   $0x1,%r13b
ffffffff810bb31d:       0f 85 9b 00 00 00       jne    ffffffff810bb3be <ktime_get_ts64+0xce>
ffffffff810bb323:       48 8b 05 0e aa f5 00    mov    0xf5aa0e(%rip),%rax        # ffffffff82015d38 <tk_core+0x78>
ffffffff810bb32a:       48 89 03                mov    %rax,(%rbx)
ffffffff810bb32d:       48 8b 3d 94 a9 f5 00    mov    0xf5a994(%rip),%rdi        # ffffffff82015cc8 <tk_core+0x8>
ffffffff810bb334:       ff 15 96 a9 f5 00       callq  *0xf5a996(%rip)        # ffffffff82015cd0 <tk_core+0x10>
ffffffff810bb33a:       48 2b 05 9f a9 f5 00    sub    0xf5a99f(%rip),%rax        # ffffffff82015ce0 <tk_core+0x20>	<--- HERE
ffffffff810bb341:       48 8b 15 90 a9 f5 00    mov    0xf5a990(%rip),%rdx        # ffffffff82015cd8 <tk_core+0x18>
ffffffff810bb348:       44 8b 05 99 a9 f5 00    mov    0xf5a999(%rip),%r8d        # ffffffff82015ce8 <tk_core+0x28>
ffffffff810bb34f:       48 8b 35 9a a9 f5 00    mov    0xf5a99a(%rip),%rsi        # ffffffff82015cf0 <tk_core+0x30>
ffffffff810bb356:       8b 0d 90 a9 f5 00       mov    0xf5a990(%rip),%ecx        # ffffffff82015cec <tk_core+0x2c>
ffffffff810bb35c:       48 8b 3d e5 a9 f5 00    mov    0xf5a9e5(%rip),%rdi        # ffffffff82015d48 <tk_core+0x88>

i.e., right after the call to tkr->read():

static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr)
{
        cycle_t cycle_now, delta;

        /* read clocksource */
        cycle_now = tkr->read(tkr->clock);
		    ^^^^^^^^^^

Ring any bells about something corrupting tk_core.timekeeper.tkr_mono or
it being uninitialized after suspend?

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 10:54 ktime_get_ts64() splat during resume Borislav Petkov
@ 2016-06-17 11:53 ` Thomas Gleixner
  2016-06-17 13:29   ` Borislav Petkov
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Gleixner @ 2016-06-17 11:53 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Ingo Molnar, Peter Zijlstra, lkml, John Stultz

On Fri, 17 Jun 2016, Borislav Petkov wrote:
> look what I've found this morning during resume:
> 
> [   45.746236] BUG: unable to handle kernel done.
> [   45.752542] NULL pointer dereference at 0000000000000001
> [   45.752544] IP: [<0000000000000001>] 0x1

> static inline cycle_t timekeeping_get_delta(struct tk_read_base *tkr)
> {
>         cycle_t cycle_now, delta;
> 
>         /* read clocksource */
>         cycle_now = tkr->read(tkr->clock);
> 		    ^^^^^^^^^^
> 
> Ring any bells about something corrupting tk_core.timekeeper.tkr_mono or
> it being uninitialized after suspend?

It must be initialized otherwise you won't reach suspend. I have no idea how
that can happen.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 11:53 ` Thomas Gleixner
@ 2016-06-17 13:29   ` Borislav Petkov
  2016-06-17 14:33     ` Borislav Petkov
  0 siblings, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-17 13:29 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Ingo Molnar, Peter Zijlstra, lkml, John Stultz

On Fri, Jun 17, 2016 at 01:53:53PM +0200, Thomas Gleixner wrote:
> It must be initialized otherwise you won't reach suspend. I have no idea how
> that can happen.

Btw, there's one other thing I'm seeing in the boot kernel, while it suspends.
It hardly is related though:

[   42.046585] kvm: exiting hardware virtualization
[   42.062446] AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0010 address=0x0000000020001000 fl]
[   42.255531] sd 3:0:0:0: [sdd] Synchronizing SCSI cache
[   42.263159] sd 3:0:0:0: [sdd] Stopping disk
[   43.090022] irq 16: nobody cared (try booting with the "irqpoll" option)
[   43.099122] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.7.0-rc3+ #4
[   43.107784] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/23
[   43.120143]  0000000000000000 ffff88043dd83e68 ffffffff812aacf3 ffff88003790e400
[   43.130145]  ffff88003790e49c ffff88043dd83e90 ffffffff810a5c3e ffff88003790e400
[   43.140118]  0000000000000000[   43.140201] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
[   43.140592] sd 2:0:0:0: [sdc] Stopping disk

[   43.157217]  0000000000000010 ffff88043dd83ec8 ffffffff810a5fe7
[   43.168302] Call Trace:
[   43.173210]  <IRQ>  [<ffffffff812aacf3>] dump_stack+0x67/0x94
[   43.181481]  [<ffffffff810a5c3e>] __report_bad_irq+0x3e/0xd0
[   43.189626]  [<ffffffff810a5fe7>] note_interrupt+0x257/0x2a0
[   43.197797]  [<ffffffff810a32b9>] handle_irq_event_percpu+0xa9/0x1f0
[   43.206655]  [<ffffffff810a3440>] handle_irq_event+0x40/0x70
[   43.214776]  [<ffffffff810a68d6>] handle_fasteoi_irq+0xb6/0x170
[   43.223166]  [<ffffffff8101911a>] handle_irq+0x1a/0x30
[   43.230748]  [<ffffffff8168c7a6>] do_IRQ+0x66/0x100
[   43.238045]  [<ffffffff8168ac46>] common_interrupt+0x86/0x86
[   43.246135]  <EOI>  [<ffffffff8133c7e8>] ? acpi_safe_halt+0x24/0x2c
[   43.254884]  [<ffffffff8133c7e6>] ? acpi_safe_halt+0x22/0x2c
[   43.262987]  [<ffffffff8133c810>] acpi_idle_do_entry+0x20/0x30
[   43.271298]  [<ffffffff8133cce1>] acpi_idle_enter+0x1dc/0x1fc
[   43.279462]  [<ffffffff8154ac2e>] cpuidle_enter_state+0x10e/0x300
[   43.287980]  [<ffffffff8154ae57>] cpuidle_enter+0x17/0x20
[   43.295799]  [<ffffffff8108f51a>] call_cpuidle+0x2a/0x40
[   43.303528]  [<ffffffff8108f94d>] cpu_startup_entry+0x2dd/0x3c0
[   43.304279] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[   43.306109] sd 1:0:0:0: [sdb] Stopping disk
[   43.325777]  [<ffffffff8103872a>] start_secondary+0x13a/0x150
[   43.333868] handlers:
[   43.338471] [<ffffffff8158d200>] azx_interrupt
[   43.345298] Disabling IRQ #16
[   43.848595] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   43.858068] sd 0:0:0:0: [sda] Stopping disk
[   44.052892] pcieport 0000:00:04.0: System wakeup enabled by ACPI
[   44.075715] ACPI: Preparing to enter system sleep state S5
[   44.083818] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query honored via cmdline
[   44.092258] reboot: Power down
[   44.096326] acpi_power_off called

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 13:29   ` Borislav Petkov
@ 2016-06-17 14:33     ` Borislav Petkov
  2016-06-17 15:28       ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-17 14:33 UTC (permalink / raw)
  To: Thomas Gleixner, Rafael J. Wysocki
  Cc: Ingo Molnar, Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
	Rafael J. Wysocki, Kees Cook, stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Linux PM list, Stephen Smalley

Ok,

bisect is done, full log below.

Rafael, that fix

  70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")

breaks s2disk here. It explodes during resume and a statically allocated
struct's member is NULL. See

https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic

for the splat and some debugging attempts.

Reverting 70595b479ce1 fixes the issue here.

@stable folk: You might want to hold off on picking up that one yet...


Now, I have serial connected to that box so if you want me to dump stuff
and try patches, let me know.

Btw, while I have your ear, can you pick up my einj fixes please?

:-)

-> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email-bp@alien8.de

Thanks!

git bisect start
# bad: [f0e52976a6fad5aa2c8868d5fa91deee91299ebb] Merge remote-tracking branch 'tip/master' into tip-timers-splat
git bisect bad f0e52976a6fad5aa2c8868d5fa91deee91299ebb
# good: [bb967271c0317de8eed56b20aae978d31507b033] Merge tag 'pwm/for-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
git bisect good bb967271c0317de8eed56b20aae978d31507b033
# good: [bb41dcd3fd81971ce17f7c0e5020d343d905358f] Merge branch 'sched/core'
git bisect good bb41dcd3fd81971ce17f7c0e5020d343d905358f
# bad: [457642d3e6e1f7412c8174dbaeb06232618b2698] Merge branch 'x86/mm'
git bisect bad 457642d3e6e1f7412c8174dbaeb06232618b2698
# bad: [d818e3414b9def094a93ce57caad9cb75ecb2a17] Merge branch 'x86/boot'
git bisect bad d818e3414b9def094a93ce57caad9cb75ecb2a17
# good: [02e8fda2cc00419a11cf38199afea4c0d7172be8] x86/signals: Add build-time checks to the siginfo compat code
git bisect good 02e8fda2cc00419a11cf38199afea4c0d7172be8
# good: [c702ff2fc395d8d9949eef8f72fd8726943c5aa1] Merge branch 'timers/core'
git bisect good c702ff2fc395d8d9949eef8f72fd8726943c5aa1
# good: [0ef1f6adccfea59c600edca7632ff0834ddfee5b] Merge branch 'x86/asm'
git bisect good 0ef1f6adccfea59c600edca7632ff0834ddfee5b
# bad: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel
git bisect bad 70595b479ce173425dd5cb347dc6c8b1060dfb2c
# good: [8d950d2fb23b696d393020486ab6a350bcb2c582] MAINTAINERS: Update the Calgary IOMMU entry
git bisect good 8d950d2fb23b696d393020486ab6a350bcb2c582
# first bad commit: [70595b479ce173425dd5cb347dc6c8b1060dfb2c] x86/power/64: Fix crash whan the hibernation code passes control to the image kernel

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 14:33     ` Borislav Petkov
@ 2016-06-17 15:28       ` Rafael J. Wysocki
  2016-06-17 16:12         ` Borislav Petkov
  2016-06-20  8:17         ` ktime_get_ts64() splat during resume chenyu
  0 siblings, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-17 15:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Logan Gunthorpe, Rafael J. Wysocki, Kees Cook,
	Stable, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linus Torvalds, Linux PM list, Stephen Smalley

On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov <bp@alien8.de> wrote:
> Ok,
>
> bisect is done, full log below.
>
> Rafael, that fix
>
>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
>
> breaks s2disk here. It explodes during resume and a statically allocated
> struct's member is NULL. See
>
> https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
>
> for the splat and some debugging attempts.
>
> Reverting 70595b479ce1 fixes the issue here.

Quite evidently, memory is corrupted in the image kernel, but this
particular commit only affects the boot kernel, so it can't really
corrupt anything in the image one.

A couple of questions:
- I guess this is reproducible 100% of the time?
- If you do "echo disk > /sys/power/state" instead of using s2disk,
does it still crash in the same way?
- Are both the image and boot kernels the same binary?

> @stable folk: You might want to hold off on picking up that one yet...
>
>
> Now, I have serial connected to that box so if you want me to dump stuff
> and try patches, let me know.
>
> Btw, while I have your ear, can you pick up my einj fixes please?
>
> :-)
>
> -> https://lkml.kernel.org/r/1463992084-10012-1-git-send-email-bp@alien8.de

Sure.  I thought Tony would pick them up.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 15:28       ` Rafael J. Wysocki
@ 2016-06-17 16:12         ` Borislav Petkov
  2016-06-17 21:03           ` Rafael J. Wysocki
  2016-06-20  8:17         ` ktime_get_ts64() splat during resume chenyu
  1 sibling, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-17 16:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Logan Gunthorpe, Rafael J. Wysocki, Kees Cook,
	Stable, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linus Torvalds, Linux PM list, Stephen Smalley

On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> A couple of questions:
> - I guess this is reproducible 100% of the time?

Yap.

I took latest Linus + tip/master which has your commit.

> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> does it still crash in the same way?

My suspend to disk script does:

echo 3 > /proc/sys/vm/drop_caches
echo "shutdown" > /sys/power/disk
echo "disk" > /sys/power/state

I don't use anything else for years now.

> - Are both the image and boot kernels the same binary?

Yep.

> Sure.  I thought Tony would pick them up.

Oh ok, next time I'll pester him.

:-)

But seriously, should we route the RAS-relevant stuff touching
drivers/acpi/apei/ through our tree instead? Provided you're fine with
them?

This way we'll unload some of the burden off you...

Thanks!

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 16:12         ` Borislav Petkov
@ 2016-06-17 21:03           ` Rafael J. Wysocki
  2016-06-18  1:11             ` Rafael J. Wysocki
  2016-06-20 14:38             ` Rafael J. Wysocki
  0 siblings, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-17 21:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rafael J. Wysocki, Thomas Gleixner, Rafael J. Wysocki,
	Ingo Molnar, Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
	Rafael J. Wysocki, Kees Cook, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Linux PM list, Stephen Smalley

On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>> A couple of questions:
>> - I guess this is reproducible 100% of the time?
>
> Yap.
>
> I took latest Linus + tip/master which has your commit.
>
>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>> does it still crash in the same way?
>
> My suspend to disk script does:
>
> echo 3 > /proc/sys/vm/drop_caches
> echo "shutdown" > /sys/power/disk
> echo "disk" > /sys/power/state
>
> I don't use anything else for years now.
>
>> - Are both the image and boot kernels the same binary?
>
> Yep.

OK, we need to find out what's wrong, then.

First, please revert the changes in hibernate_asm_64.S alone and see
if that makes any difference.

Hibernation should still work then most of the time, but the bug fixed
by this commit will be back.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 21:03           ` Rafael J. Wysocki
@ 2016-06-18  1:11             ` Rafael J. Wysocki
  2016-06-20 14:38             ` Rafael J. Wysocki
  1 sibling, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-18  1:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Logan Gunthorpe, Rafael J. Wysocki, Kees Cook,
	Stable, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linus Torvalds, Linux PM list, Stephen Smalley

On Fri, Jun 17, 2016 at 11:03 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@alien8.de> wrote:
>> On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
>>> A couple of questions:
>>> - I guess this is reproducible 100% of the time?
>>
>> Yap.
>>
>> I took latest Linus + tip/master which has your commit.
>>
>>> - If you do "echo disk > /sys/power/state" instead of using s2disk,
>>> does it still crash in the same way?
>>
>> My suspend to disk script does:
>>
>> echo 3 > /proc/sys/vm/drop_caches
>> echo "shutdown" > /sys/power/disk
>> echo "disk" > /sys/power/state
>>
>> I don't use anything else for years now.
>>
>>> - Are both the image and boot kernels the same binary?
>>
>> Yep.
>
> OK, we need to find out what's wrong, then.
>
> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

Due to the nature of the memory corruption you are seeing (the same
address appears to be corrupted every time in the same way) with 100%
reproducibility and due to the fact that new code added by the commit
in question only writes to dynamically allocated memory (and you've
already verified that https://patchwork.kernel.org/patch/9185165/
doesn't help), it is quite unlikely that the memory corruption comes
from that commit itself.

However, I see a couple of ways in which that commit might uncover a latent bug.

First, it changed the layout of the kernel text by adding the
PAGE_SIZE alignment of restore_registers().  That likely pushed stuff
behind it to new offsets, possibly including the static struct field
that is now corrupted.  Now, say that the memory corruption has always
happened at the same memory location, but previously there was nothing
in there or whatever was in there, wasn't vital.  In that case the
memory corruption might have gone unnoticed until the commit in
question caused things to move to new locations and the corrupted
location contains a vital piece of data now.  This is my current
theory.

Second, it added two invocations of get_safe_page() that in theory
might push things a bit too far towards the limit and they started to
break there.  I don't see how that can happen ATM, but I'm not
excluding this possibility yet.  It seems, though, that in that case
the corruption would be more random and I certainly wouldn't expect it
to happen at the same location every time.

One more indicator is that multiple people reported success with that
commit and in many hibernation runs, so the problem appears to be very
specific to your system and/or kernel configuration.  It also is
interesting that the memory corruption only becomes visible during the
thawing of tasks and given the piece of data that is corrupted, it
should become visible much earlier if the memory was corrupted during
image restoration by the boot kernel.

In any case, reverting the changes in hibernate_asm_64.S alone should
show us the direction, but if it makes things work again, I would try
to change the restore_registers() alignment to something smaller, like
64 (which should be safe) and see what happens then.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 15:28       ` Rafael J. Wysocki
  2016-06-17 16:12         ` Borislav Petkov
@ 2016-06-20  8:17         ` chenyu
  2016-06-20 12:21           ` Rafael J. Wysocki
  1 sibling, 1 reply; 39+ messages in thread
From: chenyu @ 2016-06-20  8:17 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Borislav Petkov, Thomas Gleixner, Rafael J. Wysocki, Ingo Molnar,
	Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
	Rafael J. Wysocki, Kees Cook, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Linux PM list, Stephen Smalley

On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov <bp@alien8.de> wrote:
>> Ok,
>>
>> bisect is done, full log below.
>>
>> Rafael, that fix
>>
>>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
>>
>> breaks s2disk here. It explodes during resume and a statically allocated
>> struct's member is NULL. See
>>
>> https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
>>
>> for the splat and some debugging attempts.
>>
>> Reverting 70595b479ce1 fixes the issue here.
>
> Quite evidently, memory is corrupted in the image kernel, but this
> particular commit only affects the boot kernel, so it can't really
> corrupt anything in the image one.
>
In previous patch,
before we jump to the new kernel entry, we add the
text mapping to temp_level4_pgt,

         /* switch over to the temporary kernel text mapping */
        movq    %r8, (%r9)
If I understand correctly,  r9 contains the virtual address
of restore_pgd_addr, since the page table for restore_pgd_addr might be
incoherent across hibernation(as NX patch changes the kernel text mapping
to dynamically mapping), so we might write pmd entry to an incorrect place in
temp_level4_pgt?

Yu

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-20  8:17         ` ktime_get_ts64() splat during resume chenyu
@ 2016-06-20 12:21           ` Rafael J. Wysocki
  0 siblings, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-20 12:21 UTC (permalink / raw)
  To: chenyu
  Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, lkml, John Stultz, Logan Gunthorpe,
	Rafael J. Wysocki, Kees Cook, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Linux PM list, Stephen Smalley

On Monday, June 20, 2016 04:17:13 PM chenyu wrote:
> On Fri, Jun 17, 2016 at 11:28 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Fri, Jun 17, 2016 at 4:33 PM, Borislav Petkov <bp@alien8.de> wrote:
> >> Ok,
> >>
> >> bisect is done, full log below.
> >>
> >> Rafael, that fix
> >>
> >>   70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
> >>
> >> breaks s2disk here. It explodes during resume and a statically allocated
> >> struct's member is NULL. See
> >>
> >> https://lkml.kernel.org/r/20160617105435.GB15997@pd.tnic
> >>
> >> for the splat and some debugging attempts.
> >>
> >> Reverting 70595b479ce1 fixes the issue here.
> >
> > Quite evidently, memory is corrupted in the image kernel, but this
> > particular commit only affects the boot kernel, so it can't really
> > corrupt anything in the image one.
> >
> In previous patch,
> before we jump to the new kernel entry, we add the
> text mapping to temp_level4_pgt,
> 
>          /* switch over to the temporary kernel text mapping */
>         movq    %r8, (%r9)
> If I understand correctly,  r9 contains the virtual address
> of restore_pgd_addr, since the page table for restore_pgd_addr might be
> incoherent across hibernation(as NX patch changes the kernel text mapping
> to dynamically mapping), so we might write pmd entry to an incorrect place in
> temp_level4_pgt?

I'm not sure what you mean.

r9 contains the address previously stored in restore_pgd_addr.  This is the
address of the cell in (the page pointed to by temp_level4_pgt) to write the
new pgd entry to.

Why do you think it may be incorrect?

In any case, the corruption reported by Boris also happens if all of the
assembly changes in commit 70595b479ce1 are reverted, in which case the code
should work as before that commit modulo some additional actions that
shouldn't matter.  Those additional actions turn out to matter, though.

I'll write about that in detail shortly.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-17 21:03           ` Rafael J. Wysocki
  2016-06-18  1:11             ` Rafael J. Wysocki
@ 2016-06-20 14:38             ` Rafael J. Wysocki
  2016-06-20 18:29               ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-20 14:38 UTC (permalink / raw)
  To: Borislav Petkov, Logan Gunthorpe, Kees Cook
  Cc: Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linus Torvalds,
	Linux PM list, Stephen Smalley

On Friday, June 17, 2016 11:03:46 PM Rafael J. Wysocki wrote:
> On Fri, Jun 17, 2016 at 6:12 PM, Borislav Petkov <bp@alien8.de> wrote:
> > On Fri, Jun 17, 2016 at 05:28:10PM +0200, Rafael J. Wysocki wrote:
> >> A couple of questions:
> >> - I guess this is reproducible 100% of the time?
> >
> > Yap.
> >
> > I took latest Linus + tip/master which has your commit.
> >
> >> - If you do "echo disk > /sys/power/state" instead of using s2disk,
> >> does it still crash in the same way?
> >
> > My suspend to disk script does:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> > echo "shutdown" > /sys/power/disk
> > echo "disk" > /sys/power/state
> >
> > I don't use anything else for years now.
> >
> >> - Are both the image and boot kernels the same binary?
> >
> > Yep.
> 
> OK, we need to find out what's wrong, then.

Boris is traveling this week, so the investigation will continue when he's
back, but we spent quite some time on this during the weekend, so below is
a summary of what we found plus some comments.

Overall, we seem to be heading towards the "really weird" territory here.

> First, please revert the changes in hibernate_asm_64.S alone and see
> if that makes any difference.
>
> Hibernation should still work then most of the time, but the bug fixed
> by this commit will be back.

First of all, reverting the changes in hibernate_asm_64.S doesn't make the
breakage go away, which means that the most essential changes in
"x86/power/64: Fix crash whan the hibernation code passes control to the image
kernel" don't make a difference.

Moreover, after some back-and-forth we narrowed down the memory corruption to
a single line in the (new) prepare_temporary_text_mapping() function and it
turned out that it is triggered by scribbling on a page returned by
get_safe_page(), immediately after obtaining it.  Memory is corrupted no matter
how the page is written to.  In particular, filling that page with all ones
triggers memory corruption later, for example.

So appended is the last tested patch sufficient to trigger the breakage on the
Boris' system (modulo some comments).  Quite evidently, the breakage is
triggered by the memset() line in prepare_temporary_text_mapping().

In addition to the above, there are some interesting observations from the
investigation so far.

First, what is corrupted is the image kernel memory.  This means that the line
in question scribbles on image data (which are stored in memory at that point)
somewhere.  Nevertheless, the image kernel is evidently able to continue after
it has received control and carry out the full device resume up to the point
in which user space is thawed and then it crashes as a result of page tables
corruption.  How those page tables are corrupted depends on what is written to
the page pointed to by pmd in prepare_temporary_text_mapping() (but, of course,
that write happens in the "boot" or "restore" kernel, before jumping to the
image one).

The above is what we know and now conclusions that it seems to lead to.

Generally, I see two possible scenarios here.

One is that the changes not directly related to prepare_temporary_text_mapping()
in the patch below, eg. the changes in arch_hibernation_header_save/restore()
(and related), make a difference, but that doesn't look likely to me.  In any
case, that will be the first thing to try next, as it is relatively easy.

The second scenario is fundamentally less attractive, because it means that the
address we end up with in pmd in prepare_temporary_text_mapping() is bogus.

My first reaction to that option would be "Well, what if get_safe_page() is
broken somehow?", but that's not so simple.  Namely, for the observed corruption
to happen, get_safe_page() would need to return a page that (1) had already
been allocated once before and (2) such that image data had been written to
it already.  Still, the hibernation core only writes image data to pages that
have been explicitly allocated by it and it never frees them individually
(in particular, they are never freed before jumping to the image kernel at
all).  It sets two bits in two different bitmaps for every page allocated by it
and checks those two bits every time before writing image data to any page.
One may think "Well, maybe the bitmaps don't work correctly then and the bits
are set when they shouldn't be sometimes?", but the problem with that line of
reasoning is that get_safe_page() checks one of those very bits before returning
a page and the page is only returned if that bit is clear.  Thus, even if the
bits were set when they shouldn't be sometimes, get_safe_page() would still
not return such a page to the caller.

The next question to ask might be "What if, horror shock, get_zeroed_page()
called by get_safe_page() returns a page that has been allocated already?", but
then the bitmap check mentioned above would still prevent get_safe_page() from
returning that page (the bit in question would be set for it).

Hence, the only way get_safe_page() might return a wrong page in
prepare_temporary_text_mapping() would be if both get_zeroed_page() and the
bitmap failed at the same time in coincidence to produce the bogus result.
Failures of one of the two individually are not observed, though, let alone a
combined failure leading to exactly the worst outcome.

And there are more questions, like for example, if get_safe_page() returns a
wrong page in that particular place in prepare_temporary_text_mapping(), why
doesn't it return wrong pages anywhere else?  There are at least several
invocations of it in set_up_temporary_mappings() and swsusp_arch_resume()
already and they don't lead to any kind of memory corruption, so why would
this particular one do that?

I'm honestly unsure about where that leads us, but it starts to look totally
weird to me.

Thanks,
Rafael


---
 arch/x86/power/hibernate_64.c |   45 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
  * Address to jump to in the last phase of restore in order to get to the image
  * kernel's text (this value is passed in the image header).
  */
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,8 +38,37 @@ unsigned long restore_cr3 __visible;
 
 pgd_t *temp_level4_pgt __visible;
 
+void *restore_pgd_addr __visible;
+pgd_t restore_pgd __visible;
+
 void *relocated_restore_code __visible;
 
+static int prepare_temporary_text_mapping(void)
+{
+	unsigned long vaddr = (unsigned long)restore_jump_address;
+	unsigned long paddr = jump_address_phys & PMD_MASK;
+	pmd_t *pmd;
+	pud_t *pud;
+
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	restore_pgd = __pgd(__pa(pud) | _KERNPG_TABLE);
+
+	pud += pud_index(vaddr);
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE));
+
+	memset(pmd, 0xff, PAGE_SIZE); /* CORRUPTION HERE! */
+
+	restore_pgd_addr = temp_level4_pgt + pgd_index(vaddr);
+	return 0;
+}
+
 static void *alloc_pgt_page(void *context)
 {
 	return (void *)get_safe_page(GFP_ATOMIC);
@@ -63,6 +93,10 @@ static int set_up_temporary_mappings(voi
 	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
 		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
 
+	result = prepare_temporary_text_mapping();
+	if (result)
+		return result;
+
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
 		mstart = pfn_mapped[i].start << PAGE_SHIFT;
@@ -108,12 +142,13 @@ int pfn_is_nosave(unsigned long pfn)
 }
 
 struct restore_data_record {
-	unsigned long jump_address;
+	void *jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +161,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = &restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +178,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-20 14:38             ` Rafael J. Wysocki
@ 2016-06-20 18:29               ` Linus Torvalds
  2016-06-20 21:15                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2016-06-20 18:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Borislav Petkov, Logan Gunthorpe, Kees Cook, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml, John Stultz,
	Rafael J. Wysocki, Stable, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> Overall, we seem to be heading towards the "really weird" territory here.

So the whole commit that Boris bisected down to is weird to me.

Why isn't the temporary text mapping just set up unconditionally in
the temp_level4_pgt?

Why does it have that insane "let's leave the temp_level4_pgt alone
until we actually switch to it, and save away restore_pgd_addr and the
restore_pgd, to then be set up at restore time"?

All the other temporary mappings are set up statically in the
temp_level4_pgt, why not that one?

I suspect whatever corruption happens boils down to the same issue
that made people do that odd decision in the first place.

And regardless, those games are too ugly to live. So I would suggest
that that original commit should just be considered broken, and
reverted (or just removed - I'm not sure if it's in a stable branch or
not) and the fix be rethought so that the code mapping can be done
once and for all and *without* the extra games.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-20 18:29               ` Linus Torvalds
@ 2016-06-20 21:15                 ` Rafael J. Wysocki
  2016-06-21  0:05                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-20 21:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rafael J. Wysocki, Borislav Petkov, Logan Gunthorpe, Kees Cook,
	Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
	Stephen Smalley

On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>
>> Overall, we seem to be heading towards the "really weird" territory here.
>
> So the whole commit that Boris bisected down to is weird to me.
>
> Why isn't the temporary text mapping just set up unconditionally in
> the temp_level4_pgt?
>
> Why does it have that insane "let's leave the temp_level4_pgt alone
> until we actually switch to it, and save away restore_pgd_addr and the
> restore_pgd, to then be set up at restore time"?
>
> All the other temporary mappings are set up statically in the
> temp_level4_pgt, why not that one?

The text mapping in temp_level4_pgt has to map the image kernel's
physical entry address to the same virtual address that the image
kernel had for it, because the image kernel will switch over to its
own page tables first and it will use its own kernel text mapping from
that point on.  That may not be the same as the text mapping of the
(currently running) restore (or "boot") kernel.

So if we set up the text mapping in temp_level4_pgt upfront, we
basically can't reference the original kernel text (or do any
addressing relative to it) any more after switching over to
temp_level4_pgt.

For some reason I thought that was not doable, but now that I look at
the code it looks like it can be done.  I'll try doing that.

> I suspect whatever corruption happens boils down to the same issue
> that made people do that odd decision in the first place.

The breakage appears to happen regardless of these changes, though.

> And regardless, those games are too ugly to live. So I would suggest
> that that original commit should just be considered broken, and
> reverted (or just removed - I'm not sure if it's in a stable branch or
> not) and the fix be rethought so that the code mapping can be done
> once and for all and *without* the extra games.

OK

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-20 21:15                 ` Rafael J. Wysocki
@ 2016-06-21  0:05                   ` Rafael J. Wysocki
  2016-06-21  1:22                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21  0:05 UTC (permalink / raw)
  To: Linus Torvalds, Logan Gunthorpe, Kees Cook
  Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, lkml, John Stultz, Rafael J. Wysocki, Stable,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >>
> >> Overall, we seem to be heading towards the "really weird" territory here.
> >
> > So the whole commit that Boris bisected down to is weird to me.
> >
> > Why isn't the temporary text mapping just set up unconditionally in
> > the temp_level4_pgt?
> >
> > Why does it have that insane "let's leave the temp_level4_pgt alone
> > until we actually switch to it, and save away restore_pgd_addr and the
> > restore_pgd, to then be set up at restore time"?
> >
> > All the other temporary mappings are set up statically in the
> > temp_level4_pgt, why not that one?
> 
> The text mapping in temp_level4_pgt has to map the image kernel's
> physical entry address to the same virtual address that the image
> kernel had for it, because the image kernel will switch over to its
> own page tables first and it will use its own kernel text mapping from
> that point on.  That may not be the same as the text mapping of the
> (currently running) restore (or "boot") kernel.
> 
> So if we set up the text mapping in temp_level4_pgt upfront, we
> basically can't reference the original kernel text (or do any
> addressing relative to it) any more after switching over to
> temp_level4_pgt.
> 
> For some reason I thought that was not doable, but now that I look at
> the code it looks like it can be done.  I'll try doing that.
> 
> > I suspect whatever corruption happens boils down to the same issue
> > that made people do that odd decision in the first place.
> 
> The breakage appears to happen regardless of these changes, though.
> 
> > And regardless, those games are too ugly to live. So I would suggest
> > that that original commit should just be considered broken, and
> > reverted (or just removed - I'm not sure if it's in a stable branch or
> > not) and the fix be rethought so that the code mapping can be done
> > once and for all and *without* the extra games.
> 
> OK

Below is a new simplified version of the offending patch (without the
restore_pgd_addr and restore_pgd things).  It works for me. :-)

Logan, Kees, can you please check if this one works for you too?

Thanks,
Rafael

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] x86/power/64: Fix kernel text mapping corruption during image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 arch/x86/power/hibernate_64.c     |   58 ++++++++++++++++++++++++++++++++------
 arch/x86/power/hibernate_asm_64.S |   55 +++++++++++++++---------------------
 2 files changed, 72 insertions(+), 41 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
  * Address to jump to in the last phase of restore in order to get to the image
  * kernel's text (this value is passed in the image header).
  */
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
 
 void *relocated_restore_code __visible;
 
+static int set_up_temporary_text_mapping(void)
+{
+	unsigned long vaddr = (unsigned long)restore_jump_address;
+	unsigned long paddr = jump_address_phys & PMD_MASK;
+	pmd_t *pmd;
+	pud_t *pud;
+
+	/*
+	 * The new mapping only has to cover the page containing the image
+	 * kernel's entry point (jump_address_phys), because the switch over to
+	 * it is carried out by relocated code running from a page allocated
+	 * specifically for this purpose and covered by the identity mapping, so
+	 * the temporary kernel text mapping is only needed for the final jump.
+	 * Moreover, in that mapping the virtual address of the image kernel's
+	 * entry point must be the same as its virtual address in the image
+	 * kernel (restore_jump_address), so the image kernel's
+	 * restore_registers() code doesn't find itself in a different area of
+	 * the virtual address space after switching over to the original page
+	 * tables used by the image kernel.
+	 */
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
+	set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
+	set_pgd(temp_level4_pgt + pgd_index(vaddr),
+		__pgd(__pa(pud) | _KERNPG_TABLE));
+
+	return 0;
+}
+
 static void *alloc_pgt_page(void *context)
 {
 	return (void *)get_safe_page(GFP_ATOMIC);
@@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
 	if (!temp_level4_pgt)
 		return -ENOMEM;
 
-	/* It is safe to reuse the original kernel mapping */
-	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+	/* Prepare a temporary mapping for the kernel text */
+	result = set_up_temporary_text_mapping();
+	if (result)
+		return result;
 
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
@@ -89,8 +127,7 @@ int swsusp_arch_resume(void)
 	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
 	if (!relocated_restore_code)
 		return -ENOMEM;
-	memcpy(relocated_restore_code, &core_restore_code,
-	       &restore_registers - &core_restore_code);
+	memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
 
 	restore_image();
 	return 0;
@@ -108,12 +145,13 @@ int pfn_is_nosave(unsigned long pfn)
 }
 
 struct restore_data_record {
-	unsigned long jump_address;
+	void *jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +164,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = &restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +181,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	/* save the address of restore_registers */
-	movq	$restore_registers, %rax
-	movq	%rax, restore_jump_address(%rip)
 	/* save cr3 */
 	movq	%cr3, %rax
 	movq	%rax, restore_cr3(%rip)
@@ -57,28 +54,29 @@ ENTRY(swsusp_arch_suspend)
 ENDPROC(swsusp_arch_suspend)
 
 ENTRY(restore_image)
+	/* prepare to jump to the image kernel */
+	movq	restore_jump_address(%rip), %r8
+	movq	restore_cr3(%rip), %r9
+
+	/* prepare to copy image data to their original locations */
+	movq	restore_pblist(%rip), %rdx
+	movq	relocated_restore_code(%rip), %r10
+
 	/* switch to temporary page tables */
-	movq	$__PAGE_OFFSET, %rdx
 	movq	temp_level4_pgt(%rip), %rax
-	subq	%rdx, %rax
+	movq	mmu_cr4_features(%rip), %rbx
+	movq	$__PAGE_OFFSET, %rcx
+	subq	%rcx, %rax
 	movq	%rax, %cr3
-	/* Flush TLB */
-	movq	mmu_cr4_features(%rip), %rax
-	movq	%rax, %rdx
-	andq	$~(X86_CR4_PGE), %rdx
-	movq	%rdx, %cr4;  # turn off PGE
+	/* flush TLB */
+	movq	%rbx, %rcx
+	andq	$~(X86_CR4_PGE), %rcx
+	movq	%rcx, %cr4;  # turn off PGE
 	movq	%cr3, %rcx;  # flush TLB
 	movq	%rcx, %cr3;
-	movq	%rax, %cr4;  # turn PGE back on
-
-	/* prepare to jump to the image kernel */
-	movq	restore_jump_address(%rip), %rax
-	movq	restore_cr3(%rip), %rbx
-
-	/* prepare to copy image data to their original locations */
-	movq	restore_pblist(%rip), %rdx
-	movq	relocated_restore_code(%rip), %rcx
-	jmpq	*%rcx
+	movq	%rbx, %cr4;  # turn PGE back on
+	/* jump to relocated restore code */
+	jmpq	*%r10
 
 	/* code below has been relocated to a safe page */
 ENTRY(core_restore_code)
@@ -96,24 +94,17 @@ ENTRY(core_restore_code)
 	/* progress to the next pbe */
 	movq	pbe_next(%rdx), %rdx
 	jmp	.Lloop
+
 .Ldone:
 	/* jump to the restore_registers address from the image header */
-	jmpq	*%rax
-	/*
-	 * NOTE: This assumes that the boot kernel's text mapping covers the
-	 * image kernel's page containing restore_registers and the address of
-	 * this page is the same as in the image kernel's text mapping (it
-	 * should always be true, because the text mapping is linear, starting
-	 * from 0, and is supposed to cover the entire kernel text for every
-	 * kernel).
-	 *
-	 * code below belongs to the image kernel
-	 */
+	jmpq	*%r8
 
+	 /* code below belongs to the image kernel */
+	.align PAGE_SIZE
 ENTRY(restore_registers)
 	FRAME_BEGIN
 	/* go back to the original page tables */
-	movq    %rbx, %cr3
+	movq    %r9, %cr3
 
 	/* Flush TLB, including "global" things (vmalloc) */
 	movq	mmu_cr4_features(%rip), %rax

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-21  0:05                   ` Rafael J. Wysocki
@ 2016-06-21  1:22                     ` Rafael J. Wysocki
  2016-06-21  4:35                       ` Logan Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21  1:22 UTC (permalink / raw)
  To: Linus Torvalds, Logan Gunthorpe, Kees Cook
  Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, lkml, John Stultz, Rafael J. Wysocki, Stable,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
> > On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >>
> > >> Overall, we seem to be heading towards the "really weird" territory here.
> > >
> > > So the whole commit that Boris bisected down to is weird to me.
> > >
> > > Why isn't the temporary text mapping just set up unconditionally in
> > > the temp_level4_pgt?
> > >
> > > Why does it have that insane "let's leave the temp_level4_pgt alone
> > > until we actually switch to it, and save away restore_pgd_addr and the
> > > restore_pgd, to then be set up at restore time"?
> > >
> > > All the other temporary mappings are set up statically in the
> > > temp_level4_pgt, why not that one?
> > 
> > The text mapping in temp_level4_pgt has to map the image kernel's
> > physical entry address to the same virtual address that the image
> > kernel had for it, because the image kernel will switch over to its
> > own page tables first and it will use its own kernel text mapping from
> > that point on.  That may not be the same as the text mapping of the
> > (currently running) restore (or "boot") kernel.
> > 
> > So if we set up the text mapping in temp_level4_pgt upfront, we
> > basically can't reference the original kernel text (or do any
> > addressing relative to it) any more after switching over to
> > temp_level4_pgt.
> > 
> > For some reason I thought that was not doable, but now that I look at
> > the code it looks like it can be done.  I'll try doing that.

I recalled what the problem was. :-)

In principle, the kernel text mapping in the image kernel may be different
from the kernel text mapping in the restore ("boot") kernel, but the patch
I posted a couple of hours ago actually assumed them to be the same, because
it switched to temp_level4_pgt before jumping to the relocated code.

To get rid of that implicit assumption it has to switch to temp_level4_pgt
from the relocated code itself, but for that to work, the page containing
the relocated code must be executable in the original page tables (it isn't
usually).

So updated patch is appended.

Again, it works for me, but I'm wondering about everybody else.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 arch/x86/power/hibernate_64.c     |   93 ++++++++++++++++++++++++++++++++------
 arch/x86/power/hibernate_asm_64.S |   55 +++++++++-------------
 2 files changed, 104 insertions(+), 44 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
  * Address to jump to in the last phase of restore in order to get to the image
  * kernel's text (this value is passed in the image header).
  */
-unsigned long restore_jump_address __visible;
+void *restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
 
 void *relocated_restore_code __visible;
 
+static int set_up_temporary_text_mapping(void)
+{
+	unsigned long vaddr = (unsigned long)restore_jump_address;
+	unsigned long paddr = jump_address_phys & PMD_MASK;
+	pmd_t *pmd;
+	pud_t *pud;
+
+	/*
+	 * The new mapping only has to cover the page containing the image
+	 * kernel's entry point (jump_address_phys), because the switch over to
+	 * it is carried out by relocated code running from a page allocated
+	 * specifically for this purpose and covered by the identity mapping, so
+	 * the temporary kernel text mapping is only needed for the final jump.
+	 * Moreover, in that mapping the virtual address of the image kernel's
+	 * entry point must be the same as its virtual address in the image
+	 * kernel (restore_jump_address), so the image kernel's
+	 * restore_registers() code doesn't find itself in a different area of
+	 * the virtual address space after switching over to the original page
+	 * tables used by the image kernel.
+	 */
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
+	set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
+	set_pgd(temp_level4_pgt + pgd_index(vaddr),
+		__pgd(__pa(pud) | _KERNPG_TABLE));
+
+	return 0;
+}
+
 static void *alloc_pgt_page(void *context)
 {
 	return (void *)get_safe_page(GFP_ATOMIC);
@@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
 	if (!temp_level4_pgt)
 		return -ENOMEM;
 
-	/* It is safe to reuse the original kernel mapping */
-	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+	/* Prepare a temporary mapping for the kernel text */
+	result = set_up_temporary_text_mapping();
+	if (result)
+		return result;
 
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +116,45 @@ static int set_up_temporary_mappings(voi
 	return 0;
 }
 
+static int relocate_restore_code(void)
+{
+	unsigned long addr;
+	pgd_t *pgd;
+	pmd_t *pmd;
+
+	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
+	if (!relocated_restore_code)
+		return -ENOMEM;
+
+	memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+	/* Make the page containing the relocated code executable */
+	addr = (unsigned long)relocated_restore_code;
+	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(addr);
+	pmd = pmd_offset(pud_offset(pgd, addr), addr);
+	if (pmd_large(*pmd)) {
+		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+	} else {
+		pte_t *pte = pte_offset_kernel(pmd, addr);
+
+		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+	}
+
+	return 0;
+}
+
 int swsusp_arch_resume(void)
 {
 	int error;
 
 	/* We have got enough memory and from now on we cannot recover */
-	if ((error = set_up_temporary_mappings()))
+	error = set_up_temporary_mappings();
+	if (error)
 		return error;
 
-	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
-	if (!relocated_restore_code)
-		return -ENOMEM;
-	memcpy(relocated_restore_code, &core_restore_code,
-	       &restore_registers - &core_restore_code);
+	error = relocate_restore_code();
+	if (error)
+		return error;
 
 	restore_image();
 	return 0;
@@ -108,12 +172,13 @@ int pfn_is_nosave(unsigned long pfn)
 }
 
 struct restore_data_record {
-	unsigned long jump_address;
+	void *jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +191,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = &restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +208,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	/* save the address of restore_registers */
-	movq	$restore_registers, %rax
-	movq	%rax, restore_jump_address(%rip)
 	/* save cr3 */
 	movq	%cr3, %rax
 	movq	%rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
 ENDPROC(swsusp_arch_suspend)
 
 ENTRY(restore_image)
-	/* switch to temporary page tables */
-	movq	$__PAGE_OFFSET, %rdx
-	movq	temp_level4_pgt(%rip), %rax
-	subq	%rdx, %rax
-	movq	%rax, %cr3
-	/* Flush TLB */
-	movq	mmu_cr4_features(%rip), %rax
-	movq	%rax, %rdx
-	andq	$~(X86_CR4_PGE), %rdx
-	movq	%rdx, %cr4;  # turn off PGE
-	movq	%cr3, %rcx;  # flush TLB
-	movq	%rcx, %cr3;
-	movq	%rax, %cr4;  # turn PGE back on
-
 	/* prepare to jump to the image kernel */
-	movq	restore_jump_address(%rip), %rax
-	movq	restore_cr3(%rip), %rbx
+	movq	restore_jump_address(%rip), %r8
+	movq	restore_cr3(%rip), %r9
+
+	/* prepare to switch to temporary page tables */
+	movq	temp_level4_pgt(%rip), %rax
+	movq	mmu_cr4_features(%rip), %rbx
 
 	/* prepare to copy image data to their original locations */
 	movq	restore_pblist(%rip), %rdx
+
+	/* jump to relocated restore code */
 	movq	relocated_restore_code(%rip), %rcx
 	jmpq	*%rcx
 
 	/* code below has been relocated to a safe page */
 ENTRY(core_restore_code)
+	/* switch to temporary page tables */
+	movq	$__PAGE_OFFSET, %rcx
+	subq	%rcx, %rax
+	movq	%rax, %cr3
+	/* flush TLB */
+	movq	%rbx, %rcx
+	andq	$~(X86_CR4_PGE), %rcx
+	movq	%rcx, %cr4;  # turn off PGE
+	movq	%cr3, %rcx;  # flush TLB
+	movq	%rcx, %cr3;
+	movq	%rbx, %cr4;  # turn PGE back on
 .Lloop:
 	testq	%rdx, %rdx
 	jz	.Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
 	/* progress to the next pbe */
 	movq	pbe_next(%rdx), %rdx
 	jmp	.Lloop
+
 .Ldone:
 	/* jump to the restore_registers address from the image header */
-	jmpq	*%rax
-	/*
-	 * NOTE: This assumes that the boot kernel's text mapping covers the
-	 * image kernel's page containing restore_registers and the address of
-	 * this page is the same as in the image kernel's text mapping (it
-	 * should always be true, because the text mapping is linear, starting
-	 * from 0, and is supposed to cover the entire kernel text for every
-	 * kernel).
-	 *
-	 * code below belongs to the image kernel
-	 */
+	jmpq	*%r8
 
+	 /* code below belongs to the image kernel */
+	.align PAGE_SIZE
 ENTRY(restore_registers)
 	FRAME_BEGIN
 	/* go back to the original page tables */
-	movq    %rbx, %cr3
+	movq    %r9, %cr3
 
 	/* Flush TLB, including "global" things (vmalloc) */
 	movq	mmu_cr4_features(%rip), %rax

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-21  1:22                     ` Rafael J. Wysocki
@ 2016-06-21  4:35                       ` Logan Gunthorpe
  2016-06-21 11:36                         ` Rafael J. Wysocki
  2016-06-21 18:04                         ` Kees Cook
  0 siblings, 2 replies; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-21  4:35 UTC (permalink / raw)
  To: Rafael J. Wysocki, Linus Torvalds, Kees Cook
  Cc: Rafael J. Wysocki, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, lkml, John Stultz, Rafael J. Wysocki, Stable,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

Hey Rafael,

This patch appears to be working on my laptop. Thanks.

Logan

On 20/06/16 07:22 PM, Rafael J. Wysocki wrote:
> On Tuesday, June 21, 2016 02:05:59 AM Rafael J. Wysocki wrote:
>> On Monday, June 20, 2016 11:15:18 PM Rafael J. Wysocki wrote:
>>> On Mon, Jun 20, 2016 at 8:29 PM, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>> On Mon, Jun 20, 2016 at 7:38 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>>>>
>>>>> Overall, we seem to be heading towards the "really weird" territory here.
>>>>
>>>> So the whole commit that Boris bisected down to is weird to me.
>>>>
>>>> Why isn't the temporary text mapping just set up unconditionally in
>>>> the temp_level4_pgt?
>>>>
>>>> Why does it have that insane "let's leave the temp_level4_pgt alone
>>>> until we actually switch to it, and save away restore_pgd_addr and the
>>>> restore_pgd, to then be set up at restore time"?
>>>>
>>>> All the other temporary mappings are set up statically in the
>>>> temp_level4_pgt, why not that one?
>>>
>>> The text mapping in temp_level4_pgt has to map the image kernel's
>>> physical entry address to the same virtual address that the image
>>> kernel had for it, because the image kernel will switch over to its
>>> own page tables first and it will use its own kernel text mapping from
>>> that point on.  That may not be the same as the text mapping of the
>>> (currently running) restore (or "boot") kernel.
>>>
>>> So if we set up the text mapping in temp_level4_pgt upfront, we
>>> basically can't reference the original kernel text (or do any
>>> addressing relative to it) any more after switching over to
>>> temp_level4_pgt.
>>>
>>> For some reason I thought that was not doable, but now that I look at
>>> the code it looks like it can be done.  I'll try doing that.
>
> I recalled what the problem was. :-)
>
> In principle, the kernel text mapping in the image kernel may be different
> from the kernel text mapping in the restore ("boot") kernel, but the patch
> I posted a couple of hours ago actually assumed them to be the same, because
> it switched to temp_level4_pgt before jumping to the relocated code.
>
> To get rid of that implicit assumption it has to switch to temp_level4_pgt
> from the relocated code itself, but for that to work, the page containing
> the relocated code must be executable in the original page tables (it isn't
> usually).
>
> So updated patch is appended.
>
> Again, it works for me, but I'm wondering about everybody else.
>
> Thanks,
> Rafael
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  arch/x86/power/hibernate_64.c     |   93 ++++++++++++++++++++++++++++++++------
>  arch/x86/power/hibernate_asm_64.S |   55 +++++++++-------------
>  2 files changed, 104 insertions(+), 44 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -27,7 +27,8 @@ extern asmlinkage __visible int restore_
>   * Address to jump to in the last phase of restore in order to get to the image
>   * kernel's text (this value is passed in the image header).
>   */
> -unsigned long restore_jump_address __visible;
> +void *restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
>  /*
>   * Value of the cr3 register from before the hibernation (this value is passed
> @@ -39,6 +40,42 @@ pgd_t *temp_level4_pgt __visible;
>
>  void *relocated_restore_code __visible;
>
> +static int set_up_temporary_text_mapping(void)
> +{
> +	unsigned long vaddr = (unsigned long)restore_jump_address;
> +	unsigned long paddr = jump_address_phys & PMD_MASK;
> +	pmd_t *pmd;
> +	pud_t *pud;
> +
> +	/*
> +	 * The new mapping only has to cover the page containing the image
> +	 * kernel's entry point (jump_address_phys), because the switch over to
> +	 * it is carried out by relocated code running from a page allocated
> +	 * specifically for this purpose and covered by the identity mapping, so
> +	 * the temporary kernel text mapping is only needed for the final jump.
> +	 * Moreover, in that mapping the virtual address of the image kernel's
> +	 * entry point must be the same as its virtual address in the image
> +	 * kernel (restore_jump_address), so the image kernel's
> +	 * restore_registers() code doesn't find itself in a different area of
> +	 * the virtual address space after switching over to the original page
> +	 * tables used by the image kernel.
> +	 */
> +	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pud)
> +		return -ENOMEM;
> +
> +	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pmd)
> +		return -ENOMEM;
> +
> +	set_pmd(pmd + pmd_index(vaddr), __pmd(paddr | __PAGE_KERNEL_LARGE_EXEC));
> +	set_pud(pud + pud_index(vaddr), __pud(__pa(pmd) | _KERNPG_TABLE));
> +	set_pgd(temp_level4_pgt + pgd_index(vaddr),
> +		__pgd(__pa(pud) | _KERNPG_TABLE));
> +
> +	return 0;
> +}
> +
>  static void *alloc_pgt_page(void *context)
>  {
>  	return (void *)get_safe_page(GFP_ATOMIC);
> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>  	if (!temp_level4_pgt)
>  		return -ENOMEM;
>
> -	/* It is safe to reuse the original kernel mapping */
> -	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> -		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> +	/* Prepare a temporary mapping for the kernel text */
> +	result = set_up_temporary_text_mapping();
> +	if (result)
> +		return result;
>
>  	/* Set up the direct mapping from scratch */
>  	for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +116,45 @@ static int set_up_temporary_mappings(voi
>  	return 0;
>  }
>
> +static int relocate_restore_code(void)
> +{
> +	unsigned long addr;
> +	pgd_t *pgd;
> +	pmd_t *pmd;
> +
> +	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> +	if (!relocated_restore_code)
> +		return -ENOMEM;
> +
> +	memcpy(relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> +	/* Make the page containing the relocated code executable */
> +	addr = (unsigned long)relocated_restore_code;
> +	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(addr);
> +	pmd = pmd_offset(pud_offset(pgd, addr), addr);
> +	if (pmd_large(*pmd)) {
> +		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> +	} else {
> +		pte_t *pte = pte_offset_kernel(pmd, addr);
> +
> +		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> +	}
> +
> +	return 0;
> +}
> +
>  int swsusp_arch_resume(void)
>  {
>  	int error;
>
>  	/* We have got enough memory and from now on we cannot recover */
> -	if ((error = set_up_temporary_mappings()))
> +	error = set_up_temporary_mappings();
> +	if (error)
>  		return error;
>
> -	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> -	if (!relocated_restore_code)
> -		return -ENOMEM;
> -	memcpy(relocated_restore_code, &core_restore_code,
> -	       &restore_registers - &core_restore_code);
> +	error = relocate_restore_code();
> +	if (error)
> +		return error;
>
>  	restore_image();
>  	return 0;
> @@ -108,12 +172,13 @@ int pfn_is_nosave(unsigned long pfn)
>  }
>
>  struct restore_data_record {
> -	unsigned long jump_address;
> +	void *jump_address;
> +	unsigned long jump_address_phys;
>  	unsigned long cr3;
>  	unsigned long magic;
>  };
>
> -#define RESTORE_MAGIC	0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC	0x123456789ABCDEF0UL
>
>  /**
>   *	arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +191,8 @@ int arch_hibernation_header_save(void *a
>
>  	if (max_size < sizeof(struct restore_data_record))
>  		return -EOVERFLOW;
> -	rdr->jump_address = restore_jump_address;
> +	rdr->jump_address = &restore_registers;
> +	rdr->jump_address_phys = __pa_symbol(&restore_registers);
>  	rdr->cr3 = restore_cr3;
>  	rdr->magic = RESTORE_MAGIC;
>  	return 0;
> @@ -142,6 +208,7 @@ int arch_hibernation_header_restore(void
>  	struct restore_data_record *rdr = addr;
>
>  	restore_jump_address = rdr->jump_address;
> +	jump_address_phys = rdr->jump_address_phys;
>  	restore_cr3 = rdr->cr3;
>  	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>  }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>  	pushfq
>  	popq	pt_regs_flags(%rax)
>
> -	/* save the address of restore_registers */
> -	movq	$restore_registers, %rax
> -	movq	%rax, restore_jump_address(%rip)
>  	/* save cr3 */
>  	movq	%cr3, %rax
>  	movq	%rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>  ENDPROC(swsusp_arch_suspend)
>
>  ENTRY(restore_image)
> -	/* switch to temporary page tables */
> -	movq	$__PAGE_OFFSET, %rdx
> -	movq	temp_level4_pgt(%rip), %rax
> -	subq	%rdx, %rax
> -	movq	%rax, %cr3
> -	/* Flush TLB */
> -	movq	mmu_cr4_features(%rip), %rax
> -	movq	%rax, %rdx
> -	andq	$~(X86_CR4_PGE), %rdx
> -	movq	%rdx, %cr4;  # turn off PGE
> -	movq	%cr3, %rcx;  # flush TLB
> -	movq	%rcx, %cr3;
> -	movq	%rax, %cr4;  # turn PGE back on
> -
>  	/* prepare to jump to the image kernel */
> -	movq	restore_jump_address(%rip), %rax
> -	movq	restore_cr3(%rip), %rbx
> +	movq	restore_jump_address(%rip), %r8
> +	movq	restore_cr3(%rip), %r9
> +
> +	/* prepare to switch to temporary page tables */
> +	movq	temp_level4_pgt(%rip), %rax
> +	movq	mmu_cr4_features(%rip), %rbx
>
>  	/* prepare to copy image data to their original locations */
>  	movq	restore_pblist(%rip), %rdx
> +
> +	/* jump to relocated restore code */
>  	movq	relocated_restore_code(%rip), %rcx
>  	jmpq	*%rcx
>
>  	/* code below has been relocated to a safe page */
>  ENTRY(core_restore_code)
> +	/* switch to temporary page tables */
> +	movq	$__PAGE_OFFSET, %rcx
> +	subq	%rcx, %rax
> +	movq	%rax, %cr3
> +	/* flush TLB */
> +	movq	%rbx, %rcx
> +	andq	$~(X86_CR4_PGE), %rcx
> +	movq	%rcx, %cr4;  # turn off PGE
> +	movq	%cr3, %rcx;  # flush TLB
> +	movq	%rcx, %cr3;
> +	movq	%rbx, %cr4;  # turn PGE back on
>  .Lloop:
>  	testq	%rdx, %rdx
>  	jz	.Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>  	/* progress to the next pbe */
>  	movq	pbe_next(%rdx), %rdx
>  	jmp	.Lloop
> +
>  .Ldone:
>  	/* jump to the restore_registers address from the image header */
> -	jmpq	*%rax
> -	/*
> -	 * NOTE: This assumes that the boot kernel's text mapping covers the
> -	 * image kernel's page containing restore_registers and the address of
> -	 * this page is the same as in the image kernel's text mapping (it
> -	 * should always be true, because the text mapping is linear, starting
> -	 * from 0, and is supposed to cover the entire kernel text for every
> -	 * kernel).
> -	 *
> -	 * code below belongs to the image kernel
> -	 */
> +	jmpq	*%r8
>
> +	 /* code below belongs to the image kernel */
> +	.align PAGE_SIZE
>  ENTRY(restore_registers)
>  	FRAME_BEGIN
>  	/* go back to the original page tables */
> -	movq    %rbx, %cr3
> +	movq    %r9, %cr3
>
>  	/* Flush TLB, including "global" things (vmalloc) */
>  	movq	mmu_cr4_features(%rip), %rax
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-21  4:35                       ` Logan Gunthorpe
@ 2016-06-21 11:36                         ` Rafael J. Wysocki
  2016-06-21 18:04                         ` Kees Cook
  1 sibling, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21 11:36 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Linus Torvalds, Kees Cook, Rafael J. Wysocki, Borislav Petkov,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml, John Stultz,
	Rafael J. Wysocki, Stable, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Monday, June 20, 2016 10:35:16 PM Logan Gunthorpe wrote:
> Hey Rafael,

Hi,

> This patch appears to be working on my laptop. Thanks.

Thanks for the confirmation!

Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-21  4:35                       ` Logan Gunthorpe
  2016-06-21 11:36                         ` Rafael J. Wysocki
@ 2016-06-21 18:04                         ` Kees Cook
  2016-06-21 23:29                           ` Rafael J. Wysocki
  2016-06-27 14:24                           ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
  1 sibling, 2 replies; 39+ messages in thread
From: Kees Cook @ 2016-06-21 18:04 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Rafael J. Wysocki, Linus Torvalds, Rafael J. Wysocki,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
	Stephen Smalley

On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hey Rafael,
>
> This patch appears to be working on my laptop. Thanks.

Same for me: resume still works with KASLR in my tests too.

-Kees

-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: ktime_get_ts64() splat during resume
  2016-06-21 18:04                         ` Kees Cook
@ 2016-06-21 23:29                           ` Rafael J. Wysocki
  2016-06-27 14:24                           ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
  1 sibling, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-21 23:29 UTC (permalink / raw)
  To: Kees Cook
  Cc: Logan Gunthorpe, Linus Torvalds, Rafael J. Wysocki,
	Borislav Petkov, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, John Stultz, Rafael J. Wysocki, Stable, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
	Stephen Smalley

On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
> 
> Same for me: resume still works with KASLR in my tests too.

Thanks for the confirmation!

Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
  2016-06-21 18:04                         ` Kees Cook
  2016-06-21 23:29                           ` Rafael J. Wysocki
@ 2016-06-27 14:24                           ` Rafael J. Wysocki
  2016-06-27 20:08                             ` Borislav Petkov
                                               ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-27 14:24 UTC (permalink / raw)
  To: Kees Cook, Logan Gunthorpe, Borislav Petkov
  Cc: Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, lkml, Rafael J. Wysocki, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
	Stephen Smalley

On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> > Hey Rafael,
> >
> > This patch appears to be working on my laptop. Thanks.
> 
> Same for me: resume still works with KASLR in my tests too.

Unfortunately, Boris still sees post-resume memory corruption with the patch
you have tested, but that turns out to be a result of some super-weird
corruption of a pointer on the stack which leads to a store at a wrong address
(and there's no way it can be related to the rest of the patch).

We have verified that it can be avoided by rearranging set_up_temporary_text_mapping()
to use fewer local variables and the appended patch contains this modification.

I also went on and changed relocate_restore_code() slightly in a similar fashion,
but all of those changes should not affect the behavior (unless there's something
insane going on behind the curtains, that is).

Kees, Logan, Boris, please try this one and let me know if it works for you.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 arch/x86/power/hibernate_64.c     |   90 ++++++++++++++++++++++++++++++++------
 arch/x86/power/hibernate_asm_64.S |   55 ++++++++++-------------
 2 files changed, 102 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
 
 pgd_t *temp_level4_pgt __visible;
 
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+	pmd_t *pmd;
+	pud_t *pud;
+
+	/*
+	 * The new mapping only has to cover the page containing the image
+	 * kernel's entry point (jump_address_phys), because the switch over to
+	 * it is carried out by relocated code running from a page allocated
+	 * specifically for this purpose and covered by the identity mapping, so
+	 * the temporary kernel text mapping is only needed for the final jump.
+	 * Moreover, in that mapping the virtual address of the image kernel's
+	 * entry point must be the same as its virtual address in the image
+	 * kernel (restore_jump_address), so the image kernel's
+	 * restore_registers() code doesn't find itself in a different area of
+	 * the virtual address space after switching over to the original page
+	 * tables used by the image kernel.
+	 */
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pmd(pmd + pmd_index(restore_jump_address),
+		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+	set_pud(pud + pud_index(restore_jump_address),
+		__pud(__pa(pmd) | _KERNPG_TABLE));
+	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+		__pgd(__pa(pud) | _KERNPG_TABLE));
+
+	return 0;
+}
 
 static void *alloc_pgt_page(void *context)
 {
@@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
 	if (!temp_level4_pgt)
 		return -ENOMEM;
 
-	/* It is safe to reuse the original kernel mapping */
-	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+	/* Prepare a temporary mapping for the kernel text */
+	result = set_up_temporary_text_mapping();
+	if (result)
+		return result;
 
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
 	return 0;
 }
 
+static int relocate_restore_code(void)
+{
+	pgd_t *pgd;
+	pmd_t *pmd;
+
+	relocated_restore_code = get_safe_page(GFP_ATOMIC);
+	if (!relocated_restore_code)
+		return -ENOMEM;
+
+	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+	/* Make the page containing the relocated code executable */
+	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+	pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
+			 relocated_restore_code);
+	if (pmd_large(*pmd)) {
+		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+	} else {
+		pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+	}
+
+	return 0;
+}
+
 int swsusp_arch_resume(void)
 {
 	int error;
 
 	/* We have got enough memory and from now on we cannot recover */
-	if ((error = set_up_temporary_mappings()))
+	error = set_up_temporary_mappings();
+	if (error)
 		return error;
 
-	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
-	if (!relocated_restore_code)
-		return -ENOMEM;
-	memcpy(relocated_restore_code, &core_restore_code,
-	       &restore_registers - &core_restore_code);
+	error = relocate_restore_code();
+	if (error)
+		return error;
 
 	restore_image();
 	return 0;
@@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
 
 struct restore_data_record {
 	unsigned long jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = (unsigned long)&restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	/* save the address of restore_registers */
-	movq	$restore_registers, %rax
-	movq	%rax, restore_jump_address(%rip)
 	/* save cr3 */
 	movq	%cr3, %rax
 	movq	%rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
 ENDPROC(swsusp_arch_suspend)
 
 ENTRY(restore_image)
-	/* switch to temporary page tables */
-	movq	$__PAGE_OFFSET, %rdx
-	movq	temp_level4_pgt(%rip), %rax
-	subq	%rdx, %rax
-	movq	%rax, %cr3
-	/* Flush TLB */
-	movq	mmu_cr4_features(%rip), %rax
-	movq	%rax, %rdx
-	andq	$~(X86_CR4_PGE), %rdx
-	movq	%rdx, %cr4;  # turn off PGE
-	movq	%cr3, %rcx;  # flush TLB
-	movq	%rcx, %cr3;
-	movq	%rax, %cr4;  # turn PGE back on
-
 	/* prepare to jump to the image kernel */
-	movq	restore_jump_address(%rip), %rax
-	movq	restore_cr3(%rip), %rbx
+	movq	restore_jump_address(%rip), %r8
+	movq	restore_cr3(%rip), %r9
+
+	/* prepare to switch to temporary page tables */
+	movq	temp_level4_pgt(%rip), %rax
+	movq	mmu_cr4_features(%rip), %rbx
 
 	/* prepare to copy image data to their original locations */
 	movq	restore_pblist(%rip), %rdx
+
+	/* jump to relocated restore code */
 	movq	relocated_restore_code(%rip), %rcx
 	jmpq	*%rcx
 
 	/* code below has been relocated to a safe page */
 ENTRY(core_restore_code)
+	/* switch to temporary page tables */
+	movq	$__PAGE_OFFSET, %rcx
+	subq	%rcx, %rax
+	movq	%rax, %cr3
+	/* flush TLB */
+	movq	%rbx, %rcx
+	andq	$~(X86_CR4_PGE), %rcx
+	movq	%rcx, %cr4;  # turn off PGE
+	movq	%cr3, %rcx;  # flush TLB
+	movq	%rcx, %cr3;
+	movq	%rbx, %cr4;  # turn PGE back on
 .Lloop:
 	testq	%rdx, %rdx
 	jz	.Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
 	/* progress to the next pbe */
 	movq	pbe_next(%rdx), %rdx
 	jmp	.Lloop
+
 .Ldone:
 	/* jump to the restore_registers address from the image header */
-	jmpq	*%rax
-	/*
-	 * NOTE: This assumes that the boot kernel's text mapping covers the
-	 * image kernel's page containing restore_registers and the address of
-	 * this page is the same as in the image kernel's text mapping (it
-	 * should always be true, because the text mapping is linear, starting
-	 * from 0, and is supposed to cover the entire kernel text for every
-	 * kernel).
-	 *
-	 * code below belongs to the image kernel
-	 */
+	jmpq	*%r8
 
+	 /* code below belongs to the image kernel */
+	.align PAGE_SIZE
 ENTRY(restore_registers)
 	FRAME_BEGIN
 	/* go back to the original page tables */
-	movq    %rbx, %cr3
+	movq    %r9, %cr3
 
 	/* Flush TLB, including "global" things (vmalloc) */
 	movq	mmu_cr4_features(%rip), %rax

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume)
  2016-06-27 14:24                           ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
@ 2016-06-27 20:08                             ` Borislav Petkov
  2016-06-27 23:33                             ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
  2016-06-30 13:17                             ` [PATCH v4] " Rafael J. Wysocki
  2 siblings, 0 replies; 39+ messages in thread
From: Borislav Petkov @ 2016-06-27 20:08 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Kees Cook, Logan Gunthorpe, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley

On Mon, Jun 27, 2016 at 04:24:22PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
> 
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
> 
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
> 
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
> 
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
> 
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
> 
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
> 
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
> 
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
> 
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  arch/x86/power/hibernate_64.c     |   90 ++++++++++++++++++++++++++++++++------
>  arch/x86/power/hibernate_asm_64.S |   55 ++++++++++-------------
>  2 files changed, 102 insertions(+), 43 deletions(-)

Reported-and-tested-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-27 14:24                           ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
  2016-06-27 20:08                             ` Borislav Petkov
@ 2016-06-27 23:33                             ` Logan Gunthorpe
  2016-06-29 14:48                               ` Kees Cook
  2016-06-30 13:17                             ` [PATCH v4] " Rafael J. Wysocki
  2 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-27 23:33 UTC (permalink / raw)
  To: Rafael J. Wysocki, Kees Cook, Borislav Petkov
  Cc: Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, lkml, Rafael J. Wysocki, Andy Lutomirski,
	Brian Gerst, Denys Vlasenko, H. Peter Anvin, Linux PM list,
	Stephen Smalley

Hey,

This version is working for me as well. Thanks.

Logan

On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>>> Hey Rafael,
>>>
>>> This patch appears to be working on my laptop. Thanks.
>>
>> Same for me: resume still works with KASLR in my tests too.
>
> Unfortunately, Boris still sees post-resume memory corruption with the patch
> you have tested, but that turns out to be a result of some super-weird
> corruption of a pointer on the stack which leads to a store at a wrong address
> (and there's no way it can be related to the rest of the patch).
>
> We have verified that it can be avoided by rearranging set_up_temporary_text_mapping()
> to use fewer local variables and the appended patch contains this modification.
>
> I also went on and changed relocate_restore_code() slightly in a similar fashion,
> but all of those changes should not affect the behavior (unless there's something
> insane going on behind the curtains, that is).
>
> Kees, Logan, Boris, please try this one and let me know if it works for you.
>
> Thanks,
> Rafael
>
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption during image restoration
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  arch/x86/power/hibernate_64.c     |   90 ++++++++++++++++++++++++++++++++------
>  arch/x86/power/hibernate_asm_64.S |   55 ++++++++++-------------
>  2 files changed, 102 insertions(+), 43 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
>   * kernel's text (this value is passed in the image header).
>   */
>  unsigned long restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
>  /*
>   * Value of the cr3 register from before the hibernation (this value is passed
> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>
>  pgd_t *temp_level4_pgt __visible;
>
> -void *relocated_restore_code __visible;
> +unsigned long relocated_restore_code __visible;
> +
> +static int set_up_temporary_text_mapping(void)
> +{
> +	pmd_t *pmd;
> +	pud_t *pud;
> +
> +	/*
> +	 * The new mapping only has to cover the page containing the image
> +	 * kernel's entry point (jump_address_phys), because the switch over to
> +	 * it is carried out by relocated code running from a page allocated
> +	 * specifically for this purpose and covered by the identity mapping, so
> +	 * the temporary kernel text mapping is only needed for the final jump.
> +	 * Moreover, in that mapping the virtual address of the image kernel's
> +	 * entry point must be the same as its virtual address in the image
> +	 * kernel (restore_jump_address), so the image kernel's
> +	 * restore_registers() code doesn't find itself in a different area of
> +	 * the virtual address space after switching over to the original page
> +	 * tables used by the image kernel.
> +	 */
> +	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pud)
> +		return -ENOMEM;
> +
> +	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pmd)
> +		return -ENOMEM;
> +
> +	set_pmd(pmd + pmd_index(restore_jump_address),
> +		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
> +	set_pud(pud + pud_index(restore_jump_address),
> +		__pud(__pa(pmd) | _KERNPG_TABLE));
> +	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
> +		__pgd(__pa(pud) | _KERNPG_TABLE));
> +
> +	return 0;
> +}
>
>  static void *alloc_pgt_page(void *context)
>  {
> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>  	if (!temp_level4_pgt)
>  		return -ENOMEM;
>
> -	/* It is safe to reuse the original kernel mapping */
> -	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> -		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> +	/* Prepare a temporary mapping for the kernel text */
> +	result = set_up_temporary_text_mapping();
> +	if (result)
> +		return result;
>
>  	/* Set up the direct mapping from scratch */
>  	for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
>  	return 0;
>  }
>
> +static int relocate_restore_code(void)
> +{
> +	pgd_t *pgd;
> +	pmd_t *pmd;
> +
> +	relocated_restore_code = get_safe_page(GFP_ATOMIC);
> +	if (!relocated_restore_code)
> +		return -ENOMEM;
> +
> +	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> +	/* Make the page containing the relocated code executable */
> +	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> +	pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
> +			 relocated_restore_code);
> +	if (pmd_large(*pmd)) {
> +		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> +	} else {
> +		pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> +		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> +	}
> +
> +	return 0;
> +}
> +
>  int swsusp_arch_resume(void)
>  {
>  	int error;
>
>  	/* We have got enough memory and from now on we cannot recover */
> -	if ((error = set_up_temporary_mappings()))
> +	error = set_up_temporary_mappings();
> +	if (error)
>  		return error;
>
> -	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> -	if (!relocated_restore_code)
> -		return -ENOMEM;
> -	memcpy(relocated_restore_code, &core_restore_code,
> -	       &restore_registers - &core_restore_code);
> +	error = relocate_restore_code();
> +	if (error)
> +		return error;
>
>  	restore_image();
>  	return 0;
> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>
>  struct restore_data_record {
>  	unsigned long jump_address;
> +	unsigned long jump_address_phys;
>  	unsigned long cr3;
>  	unsigned long magic;
>  };
>
> -#define RESTORE_MAGIC	0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC	0x123456789ABCDEF0UL
>
>  /**
>   *	arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>
>  	if (max_size < sizeof(struct restore_data_record))
>  		return -EOVERFLOW;
> -	rdr->jump_address = restore_jump_address;
> +	rdr->jump_address = (unsigned long)&restore_registers;
> +	rdr->jump_address_phys = __pa_symbol(&restore_registers);
>  	rdr->cr3 = restore_cr3;
>  	rdr->magic = RESTORE_MAGIC;
>  	return 0;
> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
>  	struct restore_data_record *rdr = addr;
>
>  	restore_jump_address = rdr->jump_address;
> +	jump_address_phys = rdr->jump_address_phys;
>  	restore_cr3 = rdr->cr3;
>  	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>  }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>  	pushfq
>  	popq	pt_regs_flags(%rax)
>
> -	/* save the address of restore_registers */
> -	movq	$restore_registers, %rax
> -	movq	%rax, restore_jump_address(%rip)
>  	/* save cr3 */
>  	movq	%cr3, %rax
>  	movq	%rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>  ENDPROC(swsusp_arch_suspend)
>
>  ENTRY(restore_image)
> -	/* switch to temporary page tables */
> -	movq	$__PAGE_OFFSET, %rdx
> -	movq	temp_level4_pgt(%rip), %rax
> -	subq	%rdx, %rax
> -	movq	%rax, %cr3
> -	/* Flush TLB */
> -	movq	mmu_cr4_features(%rip), %rax
> -	movq	%rax, %rdx
> -	andq	$~(X86_CR4_PGE), %rdx
> -	movq	%rdx, %cr4;  # turn off PGE
> -	movq	%cr3, %rcx;  # flush TLB
> -	movq	%rcx, %cr3;
> -	movq	%rax, %cr4;  # turn PGE back on
> -
>  	/* prepare to jump to the image kernel */
> -	movq	restore_jump_address(%rip), %rax
> -	movq	restore_cr3(%rip), %rbx
> +	movq	restore_jump_address(%rip), %r8
> +	movq	restore_cr3(%rip), %r9
> +
> +	/* prepare to switch to temporary page tables */
> +	movq	temp_level4_pgt(%rip), %rax
> +	movq	mmu_cr4_features(%rip), %rbx
>
>  	/* prepare to copy image data to their original locations */
>  	movq	restore_pblist(%rip), %rdx
> +
> +	/* jump to relocated restore code */
>  	movq	relocated_restore_code(%rip), %rcx
>  	jmpq	*%rcx
>
>  	/* code below has been relocated to a safe page */
>  ENTRY(core_restore_code)
> +	/* switch to temporary page tables */
> +	movq	$__PAGE_OFFSET, %rcx
> +	subq	%rcx, %rax
> +	movq	%rax, %cr3
> +	/* flush TLB */
> +	movq	%rbx, %rcx
> +	andq	$~(X86_CR4_PGE), %rcx
> +	movq	%rcx, %cr4;  # turn off PGE
> +	movq	%cr3, %rcx;  # flush TLB
> +	movq	%rcx, %cr3;
> +	movq	%rbx, %cr4;  # turn PGE back on
>  .Lloop:
>  	testq	%rdx, %rdx
>  	jz	.Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>  	/* progress to the next pbe */
>  	movq	pbe_next(%rdx), %rdx
>  	jmp	.Lloop
> +
>  .Ldone:
>  	/* jump to the restore_registers address from the image header */
> -	jmpq	*%rax
> -	/*
> -	 * NOTE: This assumes that the boot kernel's text mapping covers the
> -	 * image kernel's page containing restore_registers and the address of
> -	 * this page is the same as in the image kernel's text mapping (it
> -	 * should always be true, because the text mapping is linear, starting
> -	 * from 0, and is supposed to cover the entire kernel text for every
> -	 * kernel).
> -	 *
> -	 * code below belongs to the image kernel
> -	 */
> +	jmpq	*%r8
>
> +	 /* code below belongs to the image kernel */
> +	.align PAGE_SIZE
>  ENTRY(restore_registers)
>  	FRAME_BEGIN
>  	/* go back to the original page tables */
> -	movq    %rbx, %cr3
> +	movq    %r9, %cr3
>
>  	/* Flush TLB, including "global" things (vmalloc) */
>  	movq	mmu_cr4_features(%rip), %rax
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-27 23:33                             ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
@ 2016-06-29 14:48                               ` Kees Cook
  2016-06-30  1:52                                 ` Logan Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Kees Cook @ 2016-06-29 14:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Rafael J. Wysocki, Borislav Petkov, Linus Torvalds,
	Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
> Hey,
>
> This version is working for me as well. Thanks.
>
> Logan
>
> On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
>>
>> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>>>
>>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com>
>>> wrote:
>>>>
>>>> Hey Rafael,
>>>>
>>>> This patch appears to be working on my laptop. Thanks.
>>>
>>>
>>> Same for me: resume still works with KASLR in my tests too.
>>
>>
>> Unfortunately, Boris still sees post-resume memory corruption with the
>> patch
>> you have tested, but that turns out to be a result of some super-weird
>> corruption of a pointer on the stack which leads to a store at a wrong
>> address
>> (and there's no way it can be related to the rest of the patch).
>>
>> We have verified that it can be avoided by rearranging
>> set_up_temporary_text_mapping()
>> to use fewer local variables and the appended patch contains this
>> modification.
>>
>> I also went on and changed relocate_restore_code() slightly in a similar
>> fashion,
>> but all of those changes should not affect the behavior (unless there's
>> something
>> insane going on behind the curtains, that is).
>>
>> Kees, Logan, Boris, please try this one and let me know if it works for
>> you.

Tested-by: Kees Cook <keescook@chromium.org>

I've done a few KASLR boots, and everything continues to look good to
me. Thanks!

-Kees

>>
>> Thanks,
>> Rafael
>>
>>
>> ---
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption
>> during image restoration
>>
>> Logan Gunthorpe reports that hibernation stopped working reliably for
>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>> and rodata).
>>
>> That turns out to be a consequence of a long-standing issue with the
>> 64-bit image restoration code on x86, which is that the temporary
>> page tables set up by it to avoid page tables corruption when the
>> last bits of the image kernel's memory contents are copied into
>> their original page frames re-use the boot kernel's text mapping,
>> but that mapping may very well get corrupted just like any other
>> part of the page tables.  Of course, if that happens, the final
>> jump to the image kernel's entry point will go to nowhere.
>>
>> The exact reason why commit ab76f7b4ab23 matters here is that it
>> sometimes causes a PMD of a large page to be split into PTEs
>> that are allocated dynamically and get corrupted during image
>> restoration as described above.
>>
>> To fix that issue note that the code copying the last bits of the
>> image kernel's memory contents to the page frames occupied by them
>> previoulsy doesn't use the kernel text mapping, because it runs from
>> a special page covered by the identity mapping set up for that code
>> from scratch.  Hence, the kernel text mapping is only needed before
>> that code starts to run and then it will only be used just for the
>> final jump to the image kernel's entry point.
>>
>> Accordingly, the temporary page tables set up in swsusp_arch_resume()
>> on x86-64 need to contain the kernel text mapping too.  That mapping
>> is only going to be used for the final jump to the image kernel, so
>> it only needs to cover the image kernel's entry point, because the
>> first thing the image kernel does after getting control back is to
>> switch over to its own original page tables.  Moreover, the virtual
>> address of the image kernel's entry point in that mapping has to be
>> the same as the one mapped by the image kernel's page tables.
>>
>> With that in mind, modify the x86-64's arch_hibernation_header_save()
>> and arch_hibernation_header_restore() routines to pass the physical
>> address of the image kernel's entry point (in addition to its virtual
>> address) to the boot kernel (a small piece of assembly code involved
>> in passing the entry point's virtual address to the image kernel is
>> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
>> too to reflect the image header format change.
>>
>> Next, in set_up_temporary_mappings(), use the physical and virtual
>> addresses of the image kernel's entry point passed in the image
>> header to set up a minimum kernel text mapping (using memory pages
>> that won't be overwritten by the image kernel's memory contents) that
>> will map those addresses to each other as appropriate.
>>
>> This makes the concern about the possible corruption of the original
>> boot kernel text mapping go away and if the the minimum kernel text
>> mapping used for the final jump marks the image kernel's entry point
>> memory as executable, the jump to it is guaraneed to succeed.
>>
>> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
>> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>  arch/x86/power/hibernate_64.c     |   90
>> ++++++++++++++++++++++++++++++++------
>>  arch/x86/power/hibernate_asm_64.S |   55 ++++++++++-------------
>>  2 files changed, 102 insertions(+), 43 deletions(-)
>>
>> Index: linux-pm/arch/x86/power/hibernate_64.c
>> ===================================================================
>> --- linux-pm.orig/arch/x86/power/hibernate_64.c
>> +++ linux-pm/arch/x86/power/hibernate_64.c
>> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
>>   * kernel's text (this value is passed in the image header).
>>   */
>>  unsigned long restore_jump_address __visible;
>> +unsigned long jump_address_phys;
>>
>>  /*
>>   * Value of the cr3 register from before the hibernation (this value is
>> passed
>> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>>
>>  pgd_t *temp_level4_pgt __visible;
>>
>> -void *relocated_restore_code __visible;
>> +unsigned long relocated_restore_code __visible;
>> +
>> +static int set_up_temporary_text_mapping(void)
>> +{
>> +       pmd_t *pmd;
>> +       pud_t *pud;
>> +
>> +       /*
>> +        * The new mapping only has to cover the page containing the image
>> +        * kernel's entry point (jump_address_phys), because the switch
>> over to
>> +        * it is carried out by relocated code running from a page
>> allocated
>> +        * specifically for this purpose and covered by the identity
>> mapping, so
>> +        * the temporary kernel text mapping is only needed for the final
>> jump.
>> +        * Moreover, in that mapping the virtual address of the image
>> kernel's
>> +        * entry point must be the same as its virtual address in the
>> image
>> +        * kernel (restore_jump_address), so the image kernel's
>> +        * restore_registers() code doesn't find itself in a different
>> area of
>> +        * the virtual address space after switching over to the original
>> page
>> +        * tables used by the image kernel.
>> +        */
>> +       pud = (pud_t *)get_safe_page(GFP_ATOMIC);
>> +       if (!pud)
>> +               return -ENOMEM;
>> +
>> +       pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
>> +       if (!pmd)
>> +               return -ENOMEM;
>> +
>> +       set_pmd(pmd + pmd_index(restore_jump_address),
>> +               __pmd((jump_address_phys & PMD_MASK) |
>> __PAGE_KERNEL_LARGE_EXEC));
>> +       set_pud(pud + pud_index(restore_jump_address),
>> +               __pud(__pa(pmd) | _KERNPG_TABLE));
>> +       set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
>> +               __pgd(__pa(pud) | _KERNPG_TABLE));
>> +
>> +       return 0;
>> +}
>>
>>  static void *alloc_pgt_page(void *context)
>>  {
>> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>>         if (!temp_level4_pgt)
>>                 return -ENOMEM;
>>
>> -       /* It is safe to reuse the original kernel mapping */
>> -       set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
>> -               init_level4_pgt[pgd_index(__START_KERNEL_map)]);
>> +       /* Prepare a temporary mapping for the kernel text */
>> +       result = set_up_temporary_text_mapping();
>> +       if (result)
>> +               return result;
>>
>>         /* Set up the direct mapping from scratch */
>>         for (i = 0; i < nr_pfn_mapped; i++) {
>> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
>>         return 0;
>>  }
>>
>> +static int relocate_restore_code(void)
>> +{
>> +       pgd_t *pgd;
>> +       pmd_t *pmd;
>> +
>> +       relocated_restore_code = get_safe_page(GFP_ATOMIC);
>> +       if (!relocated_restore_code)
>> +               return -ENOMEM;
>> +
>> +       memcpy((void *)relocated_restore_code, &core_restore_code,
>> PAGE_SIZE);
>> +
>> +       /* Make the page containing the relocated code executable */
>> +       pgd = (pgd_t *)__va(read_cr3()) +
>> pgd_index(relocated_restore_code);
>> +       pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
>> +                        relocated_restore_code);
>> +       if (pmd_large(*pmd)) {
>> +               set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>> +       } else {
>> +               pte_t *pte = pte_offset_kernel(pmd,
>> relocated_restore_code);
>> +
>> +               set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>>  int swsusp_arch_resume(void)
>>  {
>>         int error;
>>
>>         /* We have got enough memory and from now on we cannot recover */
>> -       if ((error = set_up_temporary_mappings()))
>> +       error = set_up_temporary_mappings();
>> +       if (error)
>>                 return error;
>>
>> -       relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
>> -       if (!relocated_restore_code)
>> -               return -ENOMEM;
>> -       memcpy(relocated_restore_code, &core_restore_code,
>> -              &restore_registers - &core_restore_code);
>> +       error = relocate_restore_code();
>> +       if (error)
>> +               return error;
>>
>>         restore_image();
>>         return 0;
>> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>>
>>  struct restore_data_record {
>>         unsigned long jump_address;
>> +       unsigned long jump_address_phys;
>>         unsigned long cr3;
>>         unsigned long magic;
>>  };
>>
>> -#define RESTORE_MAGIC  0x0123456789ABCDEFUL
>> +#define RESTORE_MAGIC  0x123456789ABCDEF0UL
>>
>>  /**
>>   *     arch_hibernation_header_save - populate the architecture specific
>> part
>> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>>
>>         if (max_size < sizeof(struct restore_data_record))
>>                 return -EOVERFLOW;
>> -       rdr->jump_address = restore_jump_address;
>> +       rdr->jump_address = (unsigned long)&restore_registers;
>> +       rdr->jump_address_phys = __pa_symbol(&restore_registers);
>>         rdr->cr3 = restore_cr3;
>>         rdr->magic = RESTORE_MAGIC;
>>         return 0;
>> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
>>         struct restore_data_record *rdr = addr;
>>
>>         restore_jump_address = rdr->jump_address;
>> +       jump_address_phys = rdr->jump_address_phys;
>>         restore_cr3 = rdr->cr3;
>>         return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>>  }
>> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
>> ===================================================================
>> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
>> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
>> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>>         pushfq
>>         popq    pt_regs_flags(%rax)
>>
>> -       /* save the address of restore_registers */
>> -       movq    $restore_registers, %rax
>> -       movq    %rax, restore_jump_address(%rip)
>>         /* save cr3 */
>>         movq    %cr3, %rax
>>         movq    %rax, restore_cr3(%rip)
>> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>>  ENDPROC(swsusp_arch_suspend)
>>
>>  ENTRY(restore_image)
>> -       /* switch to temporary page tables */
>> -       movq    $__PAGE_OFFSET, %rdx
>> -       movq    temp_level4_pgt(%rip), %rax
>> -       subq    %rdx, %rax
>> -       movq    %rax, %cr3
>> -       /* Flush TLB */
>> -       movq    mmu_cr4_features(%rip), %rax
>> -       movq    %rax, %rdx
>> -       andq    $~(X86_CR4_PGE), %rdx
>> -       movq    %rdx, %cr4;  # turn off PGE
>> -       movq    %cr3, %rcx;  # flush TLB
>> -       movq    %rcx, %cr3;
>> -       movq    %rax, %cr4;  # turn PGE back on
>> -
>>         /* prepare to jump to the image kernel */
>> -       movq    restore_jump_address(%rip), %rax
>> -       movq    restore_cr3(%rip), %rbx
>> +       movq    restore_jump_address(%rip), %r8
>> +       movq    restore_cr3(%rip), %r9
>> +
>> +       /* prepare to switch to temporary page tables */
>> +       movq    temp_level4_pgt(%rip), %rax
>> +       movq    mmu_cr4_features(%rip), %rbx
>>
>>         /* prepare to copy image data to their original locations */
>>         movq    restore_pblist(%rip), %rdx
>> +
>> +       /* jump to relocated restore code */
>>         movq    relocated_restore_code(%rip), %rcx
>>         jmpq    *%rcx
>>
>>         /* code below has been relocated to a safe page */
>>  ENTRY(core_restore_code)
>> +       /* switch to temporary page tables */
>> +       movq    $__PAGE_OFFSET, %rcx
>> +       subq    %rcx, %rax
>> +       movq    %rax, %cr3
>> +       /* flush TLB */
>> +       movq    %rbx, %rcx
>> +       andq    $~(X86_CR4_PGE), %rcx
>> +       movq    %rcx, %cr4;  # turn off PGE
>> +       movq    %cr3, %rcx;  # flush TLB
>> +       movq    %rcx, %cr3;
>> +       movq    %rbx, %cr4;  # turn PGE back on
>>  .Lloop:
>>         testq   %rdx, %rdx
>>         jz      .Ldone
>> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>>         /* progress to the next pbe */
>>         movq    pbe_next(%rdx), %rdx
>>         jmp     .Lloop
>> +
>>  .Ldone:
>>         /* jump to the restore_registers address from the image header */
>> -       jmpq    *%rax
>> -       /*
>> -        * NOTE: This assumes that the boot kernel's text mapping covers
>> the
>> -        * image kernel's page containing restore_registers and the
>> address of
>> -        * this page is the same as in the image kernel's text mapping (it
>> -        * should always be true, because the text mapping is linear,
>> starting
>> -        * from 0, and is supposed to cover the entire kernel text for
>> every
>> -        * kernel).
>> -        *
>> -        * code below belongs to the image kernel
>> -        */
>> +       jmpq    *%r8
>>
>> +        /* code below belongs to the image kernel */
>> +       .align PAGE_SIZE
>>  ENTRY(restore_registers)
>>         FRAME_BEGIN
>>         /* go back to the original page tables */
>> -       movq    %rbx, %cr3
>> +       movq    %r9, %cr3
>>
>>         /* Flush TLB, including "global" things (vmalloc) */
>>         movq    mmu_cr4_features(%rip), %rax
>>
>



-- 
Kees Cook
Chrome OS & Brillo Security

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-29 14:48                               ` Kees Cook
@ 2016-06-30  1:52                                 ` Logan Gunthorpe
  2016-06-30  2:20                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-30  1:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: Rafael J. Wysocki, Borislav Petkov, Linus Torvalds,
	Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

Hey Raf,

Sorry to report that although the patch works the majority of the time, 
I just got a suspicious kernel panic during resume.

It said:

"kernel tried to execute NX protected page - exploit attempt? (uid: 0)"

You can find a photo of the panic here:

http://staff.deltatee.com/~logang/panic.jpg

Logan


On 29/06/16 08:48 AM, Kees Cook wrote:
> On Mon, Jun 27, 2016 at 4:33 PM, Logan Gunthorpe <logang@deltatee.com> wrote:
>> Hey,
>>
>> This version is working for me as well. Thanks.
>>
>> Logan
>>
>> On 27/06/16 08:24 AM, Rafael J. Wysocki wrote:
>>>
>>> On Tuesday, June 21, 2016 11:04:41 AM Kees Cook wrote:
>>>>
>>>> On Mon, Jun 20, 2016 at 9:35 PM, Logan Gunthorpe <logang@deltatee.com>
>>>> wrote:
>>>>>
>>>>> Hey Rafael,
>>>>>
>>>>> This patch appears to be working on my laptop. Thanks.
>>>>
>>>>
>>>> Same for me: resume still works with KASLR in my tests too.
>>>
>>>
>>> Unfortunately, Boris still sees post-resume memory corruption with the
>>> patch
>>> you have tested, but that turns out to be a result of some super-weird
>>> corruption of a pointer on the stack which leads to a store at a wrong
>>> address
>>> (and there's no way it can be related to the rest of the patch).
>>>
>>> We have verified that it can be avoided by rearranging
>>> set_up_temporary_text_mapping()
>>> to use fewer local variables and the appended patch contains this
>>> modification.
>>>
>>> I also went on and changed relocate_restore_code() slightly in a similar
>>> fashion,
>>> but all of those changes should not affect the behavior (unless there's
>>> something
>>> insane going on behind the curtains, that is).
>>>
>>> Kees, Logan, Boris, please try this one and let me know if it works for
>>> you.
>
> Tested-by: Kees Cook <keescook@chromium.org>
>
> I've done a few KASLR boots, and everything continues to look good to
> me. Thanks!
>
> -Kees
>
>>>
>>> Thanks,
>>> Rafael
>>>
>>>
>>> ---
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> Subject: [PATCH v2] x86/power/64: Fix kernel text mapping corruption
>>> during image restoration
>>>
>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>> and rodata).
>>>
>>> That turns out to be a consequence of a long-standing issue with the
>>> 64-bit image restoration code on x86, which is that the temporary
>>> page tables set up by it to avoid page tables corruption when the
>>> last bits of the image kernel's memory contents are copied into
>>> their original page frames re-use the boot kernel's text mapping,
>>> but that mapping may very well get corrupted just like any other
>>> part of the page tables.  Of course, if that happens, the final
>>> jump to the image kernel's entry point will go to nowhere.
>>>
>>> The exact reason why commit ab76f7b4ab23 matters here is that it
>>> sometimes causes a PMD of a large page to be split into PTEs
>>> that are allocated dynamically and get corrupted during image
>>> restoration as described above.
>>>
>>> To fix that issue note that the code copying the last bits of the
>>> image kernel's memory contents to the page frames occupied by them
>>> previoulsy doesn't use the kernel text mapping, because it runs from
>>> a special page covered by the identity mapping set up for that code
>>> from scratch.  Hence, the kernel text mapping is only needed before
>>> that code starts to run and then it will only be used just for the
>>> final jump to the image kernel's entry point.
>>>
>>> Accordingly, the temporary page tables set up in swsusp_arch_resume()
>>> on x86-64 need to contain the kernel text mapping too.  That mapping
>>> is only going to be used for the final jump to the image kernel, so
>>> it only needs to cover the image kernel's entry point, because the
>>> first thing the image kernel does after getting control back is to
>>> switch over to its own original page tables.  Moreover, the virtual
>>> address of the image kernel's entry point in that mapping has to be
>>> the same as the one mapped by the image kernel's page tables.
>>>
>>> With that in mind, modify the x86-64's arch_hibernation_header_save()
>>> and arch_hibernation_header_restore() routines to pass the physical
>>> address of the image kernel's entry point (in addition to its virtual
>>> address) to the boot kernel (a small piece of assembly code involved
>>> in passing the entry point's virtual address to the image kernel is
>>> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
>>> too to reflect the image header format change.
>>>
>>> Next, in set_up_temporary_mappings(), use the physical and virtual
>>> addresses of the image kernel's entry point passed in the image
>>> header to set up a minimum kernel text mapping (using memory pages
>>> that won't be overwritten by the image kernel's memory contents) that
>>> will map those addresses to each other as appropriate.
>>>
>>> This makes the concern about the possible corruption of the original
>>> boot kernel text mapping go away and if the the minimum kernel text
>>> mapping used for the final jump marks the image kernel's entry point
>>> memory as executable, the jump to it is guaraneed to succeed.
>>>
>>> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
>>> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
>>> Reported-by: Logan Gunthorpe <logang@deltatee.com>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>  arch/x86/power/hibernate_64.c     |   90
>>> ++++++++++++++++++++++++++++++++------
>>>  arch/x86/power/hibernate_asm_64.S |   55 ++++++++++-------------
>>>  2 files changed, 102 insertions(+), 43 deletions(-)
>>>
>>> Index: linux-pm/arch/x86/power/hibernate_64.c
>>> ===================================================================
>>> --- linux-pm.orig/arch/x86/power/hibernate_64.c
>>> +++ linux-pm/arch/x86/power/hibernate_64.c
>>> @@ -28,6 +28,7 @@ extern asmlinkage __visible int restore_
>>>   * kernel's text (this value is passed in the image header).
>>>   */
>>>  unsigned long restore_jump_address __visible;
>>> +unsigned long jump_address_phys;
>>>
>>>  /*
>>>   * Value of the cr3 register from before the hibernation (this value is
>>> passed
>>> @@ -37,7 +38,43 @@ unsigned long restore_cr3 __visible;
>>>
>>>  pgd_t *temp_level4_pgt __visible;
>>>
>>> -void *relocated_restore_code __visible;
>>> +unsigned long relocated_restore_code __visible;
>>> +
>>> +static int set_up_temporary_text_mapping(void)
>>> +{
>>> +       pmd_t *pmd;
>>> +       pud_t *pud;
>>> +
>>> +       /*
>>> +        * The new mapping only has to cover the page containing the image
>>> +        * kernel's entry point (jump_address_phys), because the switch
>>> over to
>>> +        * it is carried out by relocated code running from a page
>>> allocated
>>> +        * specifically for this purpose and covered by the identity
>>> mapping, so
>>> +        * the temporary kernel text mapping is only needed for the final
>>> jump.
>>> +        * Moreover, in that mapping the virtual address of the image
>>> kernel's
>>> +        * entry point must be the same as its virtual address in the
>>> image
>>> +        * kernel (restore_jump_address), so the image kernel's
>>> +        * restore_registers() code doesn't find itself in a different
>>> area of
>>> +        * the virtual address space after switching over to the original
>>> page
>>> +        * tables used by the image kernel.
>>> +        */
>>> +       pud = (pud_t *)get_safe_page(GFP_ATOMIC);
>>> +       if (!pud)
>>> +               return -ENOMEM;
>>> +
>>> +       pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
>>> +       if (!pmd)
>>> +               return -ENOMEM;
>>> +
>>> +       set_pmd(pmd + pmd_index(restore_jump_address),
>>> +               __pmd((jump_address_phys & PMD_MASK) |
>>> __PAGE_KERNEL_LARGE_EXEC));
>>> +       set_pud(pud + pud_index(restore_jump_address),
>>> +               __pud(__pa(pmd) | _KERNPG_TABLE));
>>> +       set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
>>> +               __pgd(__pa(pud) | _KERNPG_TABLE));
>>> +
>>> +       return 0;
>>> +}
>>>
>>>  static void *alloc_pgt_page(void *context)
>>>  {
>>> @@ -59,9 +96,10 @@ static int set_up_temporary_mappings(voi
>>>         if (!temp_level4_pgt)
>>>                 return -ENOMEM;
>>>
>>> -       /* It is safe to reuse the original kernel mapping */
>>> -       set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
>>> -               init_level4_pgt[pgd_index(__START_KERNEL_map)]);
>>> +       /* Prepare a temporary mapping for the kernel text */
>>> +       result = set_up_temporary_text_mapping();
>>> +       if (result)
>>> +               return result;
>>>
>>>         /* Set up the direct mapping from scratch */
>>>         for (i = 0; i < nr_pfn_mapped; i++) {
>>> @@ -78,19 +116,44 @@ static int set_up_temporary_mappings(voi
>>>         return 0;
>>>  }
>>>
>>> +static int relocate_restore_code(void)
>>> +{
>>> +       pgd_t *pgd;
>>> +       pmd_t *pmd;
>>> +
>>> +       relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>> +       if (!relocated_restore_code)
>>> +               return -ENOMEM;
>>> +
>>> +       memcpy((void *)relocated_restore_code, &core_restore_code,
>>> PAGE_SIZE);
>>> +
>>> +       /* Make the page containing the relocated code executable */
>>> +       pgd = (pgd_t *)__va(read_cr3()) +
>>> pgd_index(relocated_restore_code);
>>> +       pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
>>> +                        relocated_restore_code);
>>> +       if (pmd_large(*pmd)) {
>>> +               set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>> +       } else {
>>> +               pte_t *pte = pte_offset_kernel(pmd,
>>> relocated_restore_code);
>>> +
>>> +               set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>> +       }
>>> +
>>> +       return 0;
>>> +}
>>> +
>>>  int swsusp_arch_resume(void)
>>>  {
>>>         int error;
>>>
>>>         /* We have got enough memory and from now on we cannot recover */
>>> -       if ((error = set_up_temporary_mappings()))
>>> +       error = set_up_temporary_mappings();
>>> +       if (error)
>>>                 return error;
>>>
>>> -       relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
>>> -       if (!relocated_restore_code)
>>> -               return -ENOMEM;
>>> -       memcpy(relocated_restore_code, &core_restore_code,
>>> -              &restore_registers - &core_restore_code);
>>> +       error = relocate_restore_code();
>>> +       if (error)
>>> +               return error;
>>>
>>>         restore_image();
>>>         return 0;
>>> @@ -109,11 +172,12 @@ int pfn_is_nosave(unsigned long pfn)
>>>
>>>  struct restore_data_record {
>>>         unsigned long jump_address;
>>> +       unsigned long jump_address_phys;
>>>         unsigned long cr3;
>>>         unsigned long magic;
>>>  };
>>>
>>> -#define RESTORE_MAGIC  0x0123456789ABCDEFUL
>>> +#define RESTORE_MAGIC  0x123456789ABCDEF0UL
>>>
>>>  /**
>>>   *     arch_hibernation_header_save - populate the architecture specific
>>> part
>>> @@ -126,7 +190,8 @@ int arch_hibernation_header_save(void *a
>>>
>>>         if (max_size < sizeof(struct restore_data_record))
>>>                 return -EOVERFLOW;
>>> -       rdr->jump_address = restore_jump_address;
>>> +       rdr->jump_address = (unsigned long)&restore_registers;
>>> +       rdr->jump_address_phys = __pa_symbol(&restore_registers);
>>>         rdr->cr3 = restore_cr3;
>>>         rdr->magic = RESTORE_MAGIC;
>>>         return 0;
>>> @@ -142,6 +207,7 @@ int arch_hibernation_header_restore(void
>>>         struct restore_data_record *rdr = addr;
>>>
>>>         restore_jump_address = rdr->jump_address;
>>> +       jump_address_phys = rdr->jump_address_phys;
>>>         restore_cr3 = rdr->cr3;
>>>         return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>>>  }
>>> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
>>> ===================================================================
>>> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
>>> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
>>> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>>>         pushfq
>>>         popq    pt_regs_flags(%rax)
>>>
>>> -       /* save the address of restore_registers */
>>> -       movq    $restore_registers, %rax
>>> -       movq    %rax, restore_jump_address(%rip)
>>>         /* save cr3 */
>>>         movq    %cr3, %rax
>>>         movq    %rax, restore_cr3(%rip)
>>> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>>>  ENDPROC(swsusp_arch_suspend)
>>>
>>>  ENTRY(restore_image)
>>> -       /* switch to temporary page tables */
>>> -       movq    $__PAGE_OFFSET, %rdx
>>> -       movq    temp_level4_pgt(%rip), %rax
>>> -       subq    %rdx, %rax
>>> -       movq    %rax, %cr3
>>> -       /* Flush TLB */
>>> -       movq    mmu_cr4_features(%rip), %rax
>>> -       movq    %rax, %rdx
>>> -       andq    $~(X86_CR4_PGE), %rdx
>>> -       movq    %rdx, %cr4;  # turn off PGE
>>> -       movq    %cr3, %rcx;  # flush TLB
>>> -       movq    %rcx, %cr3;
>>> -       movq    %rax, %cr4;  # turn PGE back on
>>> -
>>>         /* prepare to jump to the image kernel */
>>> -       movq    restore_jump_address(%rip), %rax
>>> -       movq    restore_cr3(%rip), %rbx
>>> +       movq    restore_jump_address(%rip), %r8
>>> +       movq    restore_cr3(%rip), %r9
>>> +
>>> +       /* prepare to switch to temporary page tables */
>>> +       movq    temp_level4_pgt(%rip), %rax
>>> +       movq    mmu_cr4_features(%rip), %rbx
>>>
>>>         /* prepare to copy image data to their original locations */
>>>         movq    restore_pblist(%rip), %rdx
>>> +
>>> +       /* jump to relocated restore code */
>>>         movq    relocated_restore_code(%rip), %rcx
>>>         jmpq    *%rcx
>>>
>>>         /* code below has been relocated to a safe page */
>>>  ENTRY(core_restore_code)
>>> +       /* switch to temporary page tables */
>>> +       movq    $__PAGE_OFFSET, %rcx
>>> +       subq    %rcx, %rax
>>> +       movq    %rax, %cr3
>>> +       /* flush TLB */
>>> +       movq    %rbx, %rcx
>>> +       andq    $~(X86_CR4_PGE), %rcx
>>> +       movq    %rcx, %cr4;  # turn off PGE
>>> +       movq    %cr3, %rcx;  # flush TLB
>>> +       movq    %rcx, %cr3;
>>> +       movq    %rbx, %cr4;  # turn PGE back on
>>>  .Lloop:
>>>         testq   %rdx, %rdx
>>>         jz      .Ldone
>>> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>>>         /* progress to the next pbe */
>>>         movq    pbe_next(%rdx), %rdx
>>>         jmp     .Lloop
>>> +
>>>  .Ldone:
>>>         /* jump to the restore_registers address from the image header */
>>> -       jmpq    *%rax
>>> -       /*
>>> -        * NOTE: This assumes that the boot kernel's text mapping covers
>>> the
>>> -        * image kernel's page containing restore_registers and the
>>> address of
>>> -        * this page is the same as in the image kernel's text mapping (it
>>> -        * should always be true, because the text mapping is linear,
>>> starting
>>> -        * from 0, and is supposed to cover the entire kernel text for
>>> every
>>> -        * kernel).
>>> -        *
>>> -        * code below belongs to the image kernel
>>> -        */
>>> +       jmpq    *%r8
>>>
>>> +        /* code below belongs to the image kernel */
>>> +       .align PAGE_SIZE
>>>  ENTRY(restore_registers)
>>>         FRAME_BEGIN
>>>         /* go back to the original page tables */
>>> -       movq    %rbx, %cr3
>>> +       movq    %r9, %cr3
>>>
>>>         /* Flush TLB, including "global" things (vmalloc) */
>>>         movq    mmu_cr4_features(%rip), %rax
>>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30  1:52                                 ` Logan Gunthorpe
@ 2016-06-30  2:20                                   ` Rafael J. Wysocki
  2016-06-30  2:55                                     ` Rafael J. Wysocki
  2016-06-30  9:45                                     ` Borislav Petkov
  0 siblings, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30  2:20 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Kees Cook, Borislav Petkov, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley

On Wednesday, June 29, 2016 07:52:18 PM Logan Gunthorpe wrote:
> Hey Raf,
> 
> Sorry to report that although the patch works the majority of the time, 
> I just got a suspicious kernel panic during resume.
> 
> It said:
> 
> "kernel tried to execute NX protected page - exploit attempt? (uid: 0)"
> 
> You can find a photo of the panic here:
> 
> http://staff.deltatee.com/~logang/panic.jpg

Thanks for the report!

That's not what Boris was seeing at least.

It looks like clearing the NX bit for relocated_restore_code in
relocate_restore_code() didn't work for some reason.

I don't see why it may not work ATM, I need to have a fresh look at that
tomorrow.

I had hoped to be able to fix this bug for 4.7, but it looks like it will
miss the mark after all.  Oh well.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30  2:20                                   ` Rafael J. Wysocki
@ 2016-06-30  2:55                                     ` Rafael J. Wysocki
  2016-06-30  3:56                                       ` Logan Gunthorpe
  2016-06-30  9:45                                     ` Borislav Petkov
  1 sibling, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30  2:55 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Kees Cook, Borislav Petkov, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley

On Thursday, June 30, 2016 04:20:43 AM Rafael J. Wysocki wrote:
> On Wednesday, June 29, 2016 07:52:18 PM Logan Gunthorpe wrote:
> > Hey Raf,
> > 
> > Sorry to report that although the patch works the majority of the time, 
> > I just got a suspicious kernel panic during resume.
> > 
> > It said:
> > 
> > "kernel tried to execute NX protected page - exploit attempt? (uid: 0)"
> > 
> > You can find a photo of the panic here:
> > 
> > http://staff.deltatee.com/~logang/panic.jpg
> 
> Thanks for the report!
> 
> That's not what Boris was seeing at least.
> 
> It looks like clearing the NX bit for relocated_restore_code in
> relocate_restore_code() didn't work for some reason.
> 
> I don't see why it may not work ATM, I need to have a fresh look at that
> tomorrow.
> 
> I had hoped to be able to fix this bug for 4.7, but it looks like it will
> miss the mark after all.  Oh well.

The only thing that comes to mind at this point is that TLBs should be flushed
after page tables changes, so please apply the appended and let me know
if you see this panic any more with it.

Thanks,
Rafael


---
 arch/x86/power/hibernate_64.c     |   92 +++++++++++++++++++++++++++++++++-----
 arch/x86/power/hibernate_asm_64.S |   55 +++++++++-------------
 2 files changed, 104 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
 #include <asm/mtrr.h>
 #include <asm/sections.h>
 #include <asm/suspend.h>
+#include <asm/tlbflush.h>
 
 /* Defined in hibernate_asm_64.S */
 extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
 
 pgd_t *temp_level4_pgt __visible;
 
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+	pmd_t *pmd;
+	pud_t *pud;
+
+	/*
+	 * The new mapping only has to cover the page containing the image
+	 * kernel's entry point (jump_address_phys), because the switch over to
+	 * it is carried out by relocated code running from a page allocated
+	 * specifically for this purpose and covered by the identity mapping, so
+	 * the temporary kernel text mapping is only needed for the final jump.
+	 * Moreover, in that mapping the virtual address of the image kernel's
+	 * entry point must be the same as its virtual address in the image
+	 * kernel (restore_jump_address), so the image kernel's
+	 * restore_registers() code doesn't find itself in a different area of
+	 * the virtual address space after switching over to the original page
+	 * tables used by the image kernel.
+	 */
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pmd(pmd + pmd_index(restore_jump_address),
+		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+	set_pud(pud + pud_index(restore_jump_address),
+		__pud(__pa(pmd) | _KERNPG_TABLE));
+	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+		__pgd(__pa(pud) | _KERNPG_TABLE));
+
+	return 0;
+}
 
 static void *alloc_pgt_page(void *context)
 {
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
 	if (!temp_level4_pgt)
 		return -ENOMEM;
 
-	/* It is safe to reuse the original kernel mapping */
-	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+	/* Prepare a temporary mapping for the kernel text */
+	result = set_up_temporary_text_mapping();
+	if (result)
+		return result;
 
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,45 @@ static int set_up_temporary_mappings(voi
 	return 0;
 }
 
+static int relocate_restore_code(void)
+{
+	pgd_t *pgd;
+	pmd_t *pmd;
+
+	relocated_restore_code = get_safe_page(GFP_ATOMIC);
+	if (!relocated_restore_code)
+		return -ENOMEM;
+
+	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+	/* Make the page containing the relocated code executable */
+	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+	pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
+			 relocated_restore_code);
+	if (pmd_large(*pmd)) {
+		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+	} else {
+		pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+	}
+	flush_tlb_all();
+
+	return 0;
+}
+
 int swsusp_arch_resume(void)
 {
 	int error;
 
 	/* We have got enough memory and from now on we cannot recover */
-	if ((error = set_up_temporary_mappings()))
+	error = set_up_temporary_mappings();
+	if (error)
 		return error;
 
-	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
-	if (!relocated_restore_code)
-		return -ENOMEM;
-	memcpy(relocated_restore_code, &core_restore_code,
-	       &restore_registers - &core_restore_code);
+	error = relocate_restore_code();
+	if (error)
+		return error;
 
 	restore_image();
 	return 0;
@@ -109,11 +174,12 @@ int pfn_is_nosave(unsigned long pfn)
 
 struct restore_data_record {
 	unsigned long jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +192,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = (unsigned long)&restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +209,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	/* save the address of restore_registers */
-	movq	$restore_registers, %rax
-	movq	%rax, restore_jump_address(%rip)
 	/* save cr3 */
 	movq	%cr3, %rax
 	movq	%rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
 ENDPROC(swsusp_arch_suspend)
 
 ENTRY(restore_image)
-	/* switch to temporary page tables */
-	movq	$__PAGE_OFFSET, %rdx
-	movq	temp_level4_pgt(%rip), %rax
-	subq	%rdx, %rax
-	movq	%rax, %cr3
-	/* Flush TLB */
-	movq	mmu_cr4_features(%rip), %rax
-	movq	%rax, %rdx
-	andq	$~(X86_CR4_PGE), %rdx
-	movq	%rdx, %cr4;  # turn off PGE
-	movq	%cr3, %rcx;  # flush TLB
-	movq	%rcx, %cr3;
-	movq	%rax, %cr4;  # turn PGE back on
-
 	/* prepare to jump to the image kernel */
-	movq	restore_jump_address(%rip), %rax
-	movq	restore_cr3(%rip), %rbx
+	movq	restore_jump_address(%rip), %r8
+	movq	restore_cr3(%rip), %r9
+
+	/* prepare to switch to temporary page tables */
+	movq	temp_level4_pgt(%rip), %rax
+	movq	mmu_cr4_features(%rip), %rbx
 
 	/* prepare to copy image data to their original locations */
 	movq	restore_pblist(%rip), %rdx
+
+	/* jump to relocated restore code */
 	movq	relocated_restore_code(%rip), %rcx
 	jmpq	*%rcx
 
 	/* code below has been relocated to a safe page */
 ENTRY(core_restore_code)
+	/* switch to temporary page tables */
+	movq	$__PAGE_OFFSET, %rcx
+	subq	%rcx, %rax
+	movq	%rax, %cr3
+	/* flush TLB */
+	movq	%rbx, %rcx
+	andq	$~(X86_CR4_PGE), %rcx
+	movq	%rcx, %cr4;  # turn off PGE
+	movq	%cr3, %rcx;  # flush TLB
+	movq	%rcx, %cr3;
+	movq	%rbx, %cr4;  # turn PGE back on
 .Lloop:
 	testq	%rdx, %rdx
 	jz	.Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
 	/* progress to the next pbe */
 	movq	pbe_next(%rdx), %rdx
 	jmp	.Lloop
+
 .Ldone:
 	/* jump to the restore_registers address from the image header */
-	jmpq	*%rax
-	/*
-	 * NOTE: This assumes that the boot kernel's text mapping covers the
-	 * image kernel's page containing restore_registers and the address of
-	 * this page is the same as in the image kernel's text mapping (it
-	 * should always be true, because the text mapping is linear, starting
-	 * from 0, and is supposed to cover the entire kernel text for every
-	 * kernel).
-	 *
-	 * code below belongs to the image kernel
-	 */
+	jmpq	*%r8
 
+	 /* code below belongs to the image kernel */
+	.align PAGE_SIZE
 ENTRY(restore_registers)
 	FRAME_BEGIN
 	/* go back to the original page tables */
-	movq    %rbx, %cr3
+	movq    %r9, %cr3
 
 	/* Flush TLB, including "global" things (vmalloc) */
 	movq	mmu_cr4_features(%rip), %rax

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30  2:55                                     ` Rafael J. Wysocki
@ 2016-06-30  3:56                                       ` Logan Gunthorpe
  2016-06-30 12:16                                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-30  3:56 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Kees Cook, Borislav Petkov, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley



On 29/06/16 08:55 PM, Rafael J. Wysocki wrote:

> The only thing that comes to mind at this point is that TLBs should be flushed
> after page tables changes, so please apply the appended and let me know
> if you see this panic any more with it.


Ok, I'll build a new kernel tomorrow. But keep in mind the panic is 
pretty rare as I've only seen it once so far after a couple dozen or so 
hibernates. So it may be hard to get a concrete yes or no on whether it 
fixes the issue.

I've got a script to run a bunch of hibernates in a row. I usually only 
run it for a handful of iterations, but I'll try running it for much 
longer with this patch and let you know in a couple days.

Logan


> Thanks,
> Rafael
>
>
> ---
>  arch/x86/power/hibernate_64.c     |   92 +++++++++++++++++++++++++++++++++-----
>  arch/x86/power/hibernate_asm_64.S |   55 +++++++++-------------
>  2 files changed, 104 insertions(+), 43 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -19,6 +19,7 @@
>  #include <asm/mtrr.h>
>  #include <asm/sections.h>
>  #include <asm/suspend.h>
> +#include <asm/tlbflush.h>
>
>  /* Defined in hibernate_asm_64.S */
>  extern asmlinkage __visible int restore_image(void);
> @@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
>   * kernel's text (this value is passed in the image header).
>   */
>  unsigned long restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
>  /*
>   * Value of the cr3 register from before the hibernation (this value is passed
> @@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
>
>  pgd_t *temp_level4_pgt __visible;
>
> -void *relocated_restore_code __visible;
> +unsigned long relocated_restore_code __visible;
> +
> +static int set_up_temporary_text_mapping(void)
> +{
> +	pmd_t *pmd;
> +	pud_t *pud;
> +
> +	/*
> +	 * The new mapping only has to cover the page containing the image
> +	 * kernel's entry point (jump_address_phys), because the switch over to
> +	 * it is carried out by relocated code running from a page allocated
> +	 * specifically for this purpose and covered by the identity mapping, so
> +	 * the temporary kernel text mapping is only needed for the final jump.
> +	 * Moreover, in that mapping the virtual address of the image kernel's
> +	 * entry point must be the same as its virtual address in the image
> +	 * kernel (restore_jump_address), so the image kernel's
> +	 * restore_registers() code doesn't find itself in a different area of
> +	 * the virtual address space after switching over to the original page
> +	 * tables used by the image kernel.
> +	 */
> +	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pud)
> +		return -ENOMEM;
> +
> +	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pmd)
> +		return -ENOMEM;
> +
> +	set_pmd(pmd + pmd_index(restore_jump_address),
> +		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
> +	set_pud(pud + pud_index(restore_jump_address),
> +		__pud(__pa(pmd) | _KERNPG_TABLE));
> +	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
> +		__pgd(__pa(pud) | _KERNPG_TABLE));
> +
> +	return 0;
> +}
>
>  static void *alloc_pgt_page(void *context)
>  {
> @@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
>  	if (!temp_level4_pgt)
>  		return -ENOMEM;
>
> -	/* It is safe to reuse the original kernel mapping */
> -	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> -		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> +	/* Prepare a temporary mapping for the kernel text */
> +	result = set_up_temporary_text_mapping();
> +	if (result)
> +		return result;
>
>  	/* Set up the direct mapping from scratch */
>  	for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +117,45 @@ static int set_up_temporary_mappings(voi
>  	return 0;
>  }
>
> +static int relocate_restore_code(void)
> +{
> +	pgd_t *pgd;
> +	pmd_t *pmd;
> +
> +	relocated_restore_code = get_safe_page(GFP_ATOMIC);
> +	if (!relocated_restore_code)
> +		return -ENOMEM;
> +
> +	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> +	/* Make the page containing the relocated code executable */
> +	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> +	pmd = pmd_offset(pud_offset(pgd, relocated_restore_code),
> +			 relocated_restore_code);
> +	if (pmd_large(*pmd)) {
> +		set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> +	} else {
> +		pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> +		set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> +	}
> +	flush_tlb_all();
> +
> +	return 0;
> +}
> +
>  int swsusp_arch_resume(void)
>  {
>  	int error;
>
>  	/* We have got enough memory and from now on we cannot recover */
> -	if ((error = set_up_temporary_mappings()))
> +	error = set_up_temporary_mappings();
> +	if (error)
>  		return error;
>
> -	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> -	if (!relocated_restore_code)
> -		return -ENOMEM;
> -	memcpy(relocated_restore_code, &core_restore_code,
> -	       &restore_registers - &core_restore_code);
> +	error = relocate_restore_code();
> +	if (error)
> +		return error;
>
>  	restore_image();
>  	return 0;
> @@ -109,11 +174,12 @@ int pfn_is_nosave(unsigned long pfn)
>
>  struct restore_data_record {
>  	unsigned long jump_address;
> +	unsigned long jump_address_phys;
>  	unsigned long cr3;
>  	unsigned long magic;
>  };
>
> -#define RESTORE_MAGIC	0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC	0x123456789ABCDEF0UL
>
>  /**
>   *	arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +192,8 @@ int arch_hibernation_header_save(void *a
>
>  	if (max_size < sizeof(struct restore_data_record))
>  		return -EOVERFLOW;
> -	rdr->jump_address = restore_jump_address;
> +	rdr->jump_address = (unsigned long)&restore_registers;
> +	rdr->jump_address_phys = __pa_symbol(&restore_registers);
>  	rdr->cr3 = restore_cr3;
>  	rdr->magic = RESTORE_MAGIC;
>  	return 0;
> @@ -142,6 +209,7 @@ int arch_hibernation_header_restore(void
>  	struct restore_data_record *rdr = addr;
>
>  	restore_jump_address = rdr->jump_address;
> +	jump_address_phys = rdr->jump_address_phys;
>  	restore_cr3 = rdr->cr3;
>  	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>  }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>  	pushfq
>  	popq	pt_regs_flags(%rax)
>
> -	/* save the address of restore_registers */
> -	movq	$restore_registers, %rax
> -	movq	%rax, restore_jump_address(%rip)
>  	/* save cr3 */
>  	movq	%cr3, %rax
>  	movq	%rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>  ENDPROC(swsusp_arch_suspend)
>
>  ENTRY(restore_image)
> -	/* switch to temporary page tables */
> -	movq	$__PAGE_OFFSET, %rdx
> -	movq	temp_level4_pgt(%rip), %rax
> -	subq	%rdx, %rax
> -	movq	%rax, %cr3
> -	/* Flush TLB */
> -	movq	mmu_cr4_features(%rip), %rax
> -	movq	%rax, %rdx
> -	andq	$~(X86_CR4_PGE), %rdx
> -	movq	%rdx, %cr4;  # turn off PGE
> -	movq	%cr3, %rcx;  # flush TLB
> -	movq	%rcx, %cr3;
> -	movq	%rax, %cr4;  # turn PGE back on
> -
>  	/* prepare to jump to the image kernel */
> -	movq	restore_jump_address(%rip), %rax
> -	movq	restore_cr3(%rip), %rbx
> +	movq	restore_jump_address(%rip), %r8
> +	movq	restore_cr3(%rip), %r9
> +
> +	/* prepare to switch to temporary page tables */
> +	movq	temp_level4_pgt(%rip), %rax
> +	movq	mmu_cr4_features(%rip), %rbx
>
>  	/* prepare to copy image data to their original locations */
>  	movq	restore_pblist(%rip), %rdx
> +
> +	/* jump to relocated restore code */
>  	movq	relocated_restore_code(%rip), %rcx
>  	jmpq	*%rcx
>
>  	/* code below has been relocated to a safe page */
>  ENTRY(core_restore_code)
> +	/* switch to temporary page tables */
> +	movq	$__PAGE_OFFSET, %rcx
> +	subq	%rcx, %rax
> +	movq	%rax, %cr3
> +	/* flush TLB */
> +	movq	%rbx, %rcx
> +	andq	$~(X86_CR4_PGE), %rcx
> +	movq	%rcx, %cr4;  # turn off PGE
> +	movq	%cr3, %rcx;  # flush TLB
> +	movq	%rcx, %cr3;
> +	movq	%rbx, %cr4;  # turn PGE back on
>  .Lloop:
>  	testq	%rdx, %rdx
>  	jz	.Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>  	/* progress to the next pbe */
>  	movq	pbe_next(%rdx), %rdx
>  	jmp	.Lloop
> +
>  .Ldone:
>  	/* jump to the restore_registers address from the image header */
> -	jmpq	*%rax
> -	/*
> -	 * NOTE: This assumes that the boot kernel's text mapping covers the
> -	 * image kernel's page containing restore_registers and the address of
> -	 * this page is the same as in the image kernel's text mapping (it
> -	 * should always be true, because the text mapping is linear, starting
> -	 * from 0, and is supposed to cover the entire kernel text for every
> -	 * kernel).
> -	 *
> -	 * code below belongs to the image kernel
> -	 */
> +	jmpq	*%r8
>
> +	 /* code below belongs to the image kernel */
> +	.align PAGE_SIZE
>  ENTRY(restore_registers)
>  	FRAME_BEGIN
>  	/* go back to the original page tables */
> -	movq    %rbx, %cr3
> +	movq    %r9, %cr3
>
>  	/* Flush TLB, including "global" things (vmalloc) */
>  	movq	mmu_cr4_features(%rip), %rax
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30  2:20                                   ` Rafael J. Wysocki
  2016-06-30  2:55                                     ` Rafael J. Wysocki
@ 2016-06-30  9:45                                     ` Borislav Petkov
  2016-06-30 11:27                                       ` Rafael J. Wysocki
  1 sibling, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-30  9:45 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Logan Gunthorpe, Kees Cook, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 04:20:43AM +0200, Rafael J. Wysocki wrote:
> That's not what Boris was seeing at least.

Well, I had it a couple of times during testing patches. This is all
from the logs:

[   65.121109] PM: Basic memory bitmaps freed
[   65.125991] Restarting tasks ... 
[   65.129342] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   65.129585] done.
[   65.141314] BUG: unable to handle kernel paging request at ffff88042b957e40
[   65.141316] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[   65.141318] PGD 2067067 PUD 206a067 PMD 800000042b8001e3 
[   65.141319] Oops: 0011 [#1] PREEMPT SMP
[   65.141327] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod amdkfd kvm_amd kvm amd_iommu_v2 irqbypass crc32_pclmul radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq
[   65.141328] CPU: 6 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #4
[   65.141329] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[   65.141329] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[   65.141331] RIP: 0010:[<ffff88042b957e40>]  [<ffff88042b957e40>] 0xffff88042b957e40
[   65.141331] RSP: 0018:ffff88042b957e00  EFLAGS: 00010282
[   65.141332] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[   65.141333] RDX: 0000000000000001 RSI: ffffffff81063b59 RDI: ffffffff8168898c
[   65.141333] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[   65.141334] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88042b954000
[   65.141334] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[   65.141335] FS:  00007fad32173800(0000) GS:ffff88043dd80000(0000) knlGS:0000000000000000
[   65.141336] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   65.141336] CR2: ffff88042b957e40 CR3: 00000004298e6000 CR4: 00000000000406e0
[   65.141336] Stack:
[   65.141338]  ffff880037b81000 ffff880037b81000 0000000000000000 ffffffff81181e1e
[   65.141339]  ffffff9c00000002 ffff880429e8c600 ffffffff811782bf 0000000000000011
[   65.141340]  000000000000049c 0000000000000001 0000000000001180 0000000000000000
[   65.141340] Call Trace:
[   65.141344]  [<ffffffff81181e1e>] ? getname_flags+0x5e/0x1b0
[   65.141346]  [<ffffffff811782bf>] ? cp_new_stat+0x10f/0x120
[   65.141348]  [<ffffffff810bb33a>] ? ktime_get_ts64+0x4a/0xf0
[   65.141353]  [<ffffffff81185fc7>] ? poll_select_copy_remaining+0xe7/0x130
[   65.141355]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[   65.141356]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[   65.141358]  [<ffffffff81688e72>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[   65.141374] Code: 00 00 00 1e 1e 18 81 ff ff ff ff 02 00 00 00 9c ff ff ff 00 c6 e8 29 04 88 ff ff bf 82 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <9c> 04 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00 
[   65.141375] RIP  [<ffff88042b957e40>] 0xffff88042b957e40
[   65.141376]  RSP <ffff88042b957e00>
[   65.141376] CR2: ffff88042b957e40
[   65.141378] ---[ end trace 5dc71ecf8d888ee6 ]---
[   65.141509] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[   65.141509] 
[   65.149191] Kernel Offset: disabled
[   65.449314] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

...

[  381.835297] Restarting tasks ... 
[  381.838620] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[  381.838689] done.
[  381.850763] BUG: unable to handle kernel paging request at ffff88042b957e40
[  381.850765] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[  381.850766] PGD 2065067 PUD 2068067 PMD 800000042b8001e3 
[  381.850767] Oops: 0011 [#1] PREEMPT SMP
[  381.850778] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod amdkfd kvm_amd kvm amd_iommu_v2 radeon irqbypass crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd k10temp fam15h_power acpi_cpufreq
[  381.850779] CPU: 3 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #1
[  381.850780] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[  381.850781] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[  381.850782] RIP: 0010:[<ffff88042b957e40>]  [<ffff88042b957e40>] 0xffff88042b957e40
[  381.850783] RSP: 0018:ffff88042b957e00  EFLAGS: 00010282
[  381.850783] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[  381.850784] RDX: 0000000000000001 RSI: ffffffff81062a2d RDI: ffffffff81687d8c
[  381.850784] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[  381.850785] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88042b954000
[  381.850785] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[  381.850786] FS:  00007f1143649800(0000) GS:ffff88043dcc0000(0000) knlGS:0000000000000000
[  381.850787] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  381.850787] CR2: ffff88042b957e40 CR3: 00000004298af000 CR4: 00000000000406e0
[  381.850788] Stack:
[  381.850789]  ffff88042b1ed000 ffff88042b1ed000 0000000000000000 ffffffff8117f8ae
[  381.850790]  ffffff9c00000002 ffff88042b09ac00 ffffffff81175d5f 0000000000000011
[  381.850791]  0000000000001c3d 0000000000000001 0000000000001180 0000000000000000
[  381.850792] Call Trace:
[  381.850795]  [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
[  381.850797]  [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
[  381.850799]  [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
[  381.850800]  [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
[  381.850802]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[  381.850804]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[  381.850806]  [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[  381.850820] Code: 00 00 00 ae f8 17 81 ff ff ff ff 02 00 00 00 9c ff ff ff 00 ac 09 2b 04 88 ff ff 5f 5d 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <3d> 1c 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00 
[  381.850821] RIP  [<ffff88042b957e40>] 0xffff88042b957e40
[  381.850821]  RSP <ffff88042b957e00>
[  381.850821] CR2: ffff88042b957e40
[  381.850824] ---[ end trace b4f9b4244a59d886 ]---
[  381.851025] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

...

[   49.003526] Restarting tasks ... 
[   49.007083] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   49.007237] done.
[   49.022621] BUG: unable to handle kernel paging request at ffff88042b957e40
[   49.022624] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[   49.022627] PGD 2065067 PUD 2068067 PMD 800000042b8001e3 
[   49.022629] Oops: 0011 [#1] PREEMPT SMP
[   49.022642] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm amdkfd irqbypass crc32_pclmul amd_iommu_v2 radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd k10temp fam15h_power acpi_cpufreq
[   49.022645] CPU: 4 PID: 1 Comm: init Not tainted 4.7.0-rc3+ #2
[   49.022646] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[   49.022648] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[   49.022650] RIP: 0010:[<ffff88042b957e40>]  [<ffff88042b957e40>] 0xffff88042b957e40
[   49.022652] RSP: 0018:ffff88042b957e00  EFLAGS: 00010282
[   49.022653] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[   49.022654] RDX: 0000000000000001 RSI: ffffffff81062a2d RDI: ffffffff81687d8c
[   49.022655] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[   49.022657] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88042b954000
[   49.022658] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[   49.022660] FS:  00007fe2cd5dc800(0000) GS:ffff88043dd00000(0000) knlGS:0000000000000000
[   49.022661] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   49.022662] CR2: ffff88042b957e40 CR3: 0000000429edd000 CR4: 00000000000406e0
[   49.022663] Stack:
[   49.022666]  ffff88042aca7000 ffff88042aca7000 0000000000000000 ffffffff8117f8ae
[   49.022668]  ffffff9c00000002 ffff880429e6e000 ffffffff81175d5f 0000000000000011
[   49.022674]  0000000000001c49 0000000000000001 0000000000001180 0000000000000000
[   49.022675] Call Trace:
[   49.022680]  [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
[   49.022683]  [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
[   49.022686]  [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
[   49.022689]  [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
[   49.022692]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[   49.022695]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[   49.022698]  [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[   49.022725] Code: 00 00 00 ae f8 17 81 ff ff ff ff 02 00 00 00 9c ff ff ff 00 e0 e6 29 04 88 ff ff 5f 5d 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <49> 1c 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00 
[   49.022727] RIP  [<ffff88042b957e40>] 0xffff88042b957e40
[   49.022728]  RSP <ffff88042b957e00>
[   49.022729] CR2: ffff88042b957e40
[   49.022732] ---[ end trace 6694c76b6124dda9 ]---
[   49.022911] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[   49.022911] 
[   49.030807] Kernel Offset: disabled
[   49.348267] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

...

[   39.616661] PM: Basic memory bitmaps freed
[   39.621491] Restarting tasks ... 
[   39.624829] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[   39.624908] done.
[   39.636878] BUG: unable to handle kernel paging request at ffff88042b957e40
[   39.636880] IP: [<ffff88042b957e40>] 0xffff88042b957e40
[   39.636882] PGD 2065067 PUD 2068067 PMD 800000042b8001e3 
[   39.636883] Oops: 0011 [#1] PREEMPT SMP
[   39.636890] Modules linked in: binfmt_misc ipv6 vfat fat amd64_edac_mod edac_mce_amd fuse dm_crypt dm_mod kvm_amd kvm irqbypass crc32_pclmul amdkfd amd_iommu_v2 radeon aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd fam15h_power k10temp acpi_cpufreq
[   39.636892] CPU: 6 PID: 1 Comm: init Not tainted 4.7.0-rc4+ #1
[   39.636893] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[   39.636894] task: ffff88042b958000 ti: ffff88042b954000 task.ti: ffff88042b954000
[   39.636895] RIP: 0010:[<ffff88042b957e40>]  [<ffff88042b957e40>] 0xffff88042b957e40
[   39.636895] RSP: 0018:ffff88042b957e00  EFLAGS: 00010282
[   39.636896] RAX: 0000000000000000 RBX: ffff88042b957f58 RCX: 0000000000000000
[   39.636897] RDX: 0000000000000001 RSI: ffffffff81062a2d RDI: ffffffff81687d8c
[   39.636897] RBP: ffff88042b957ef0 R08: 0000000000000000 R09: 0000000000000002
[   39.636898] R10: 00000000ffffffff R11: 0000000000000001 R12: ffff88042b954000
[   39.636898] R13: ffff88042b954000 R14: ffff88042b957f58 R15: ffff88042b958000
[   39.636899] FS:  00007f45944a4800(0000) GS:ffff88043dd80000(0000) knlGS:0000000000000000
[   39.636900] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   39.636900] CR2: ffff88042b957e40 CR3: 0000000429015000 CR4: 00000000000406e0
[   39.636901] Stack:
[   39.636902]  ffff8800b9ec5000 ffff8800b9ec5000 0000000000000000 ffffffff8117f8be
[   39.636903]  ffffff9c00000002 ffff88042ae8aa80 ffffffff81175d6f 0000000000000011
[   39.636904]  000000000000284c 0000000000000001 0000000000001180 0000000000000000
[   39.636905] Call Trace:
[   39.636908]  [<ffffffff8117f8be>] ? getname_flags+0x5e/0x1b0
[   39.636910]  [<ffffffff81175d6f>] ? cp_new_stat+0x10f/0x120
[   39.636912]  [<ffffffff810b9eaa>] ? ktime_get_ts64+0x4a/0xf0
[   39.636917]  [<ffffffff81183a67>] ? poll_select_copy_remaining+0xe7/0x130
[   39.636919]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
[   39.636921]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
[   39.636922]  [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7
[   39.636939] Code: 00 00 00 be f8 17 81 ff ff ff ff 02 00 00 00 9c ff ff ff 80 aa e8 2a 04 88 ff ff 6f 5d 17 81 ff ff ff ff 11 00 00 00 00 00 00 00 <4c> 28 00 00 00 00 00 00 01 00 00 00 00 00 00 00 80 11 00 00 00 
[   39.636939] RIP  [<ffff88042b957e40>] 0xffff88042b957e40
[   39.636940]  RSP <ffff88042b957e00>
[   39.636940] CR2: ffff88042b957e40
[   39.636943] ---[ end trace 7b732e7484eb8577 ]---
[   39.637066] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
[   39.637066] 
[   39.644839] Kernel Offset: disabled
[   39.944295] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

...



-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30  9:45                                     ` Borislav Petkov
@ 2016-06-30 11:27                                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 11:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rafael J. Wysocki, Logan Gunthorpe, Kees Cook, Linus Torvalds,
	Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 11:45 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Jun 30, 2016 at 04:20:43AM +0200, Rafael J. Wysocki wrote:
>> That's not what Boris was seeing at least.
>
> Well, I had it a couple of times during testing patches. This is all
> from the logs:
>
> [   65.121109] PM: Basic memory bitmaps freed
> [   65.125991] Restarting tasks ...
> [   65.129342] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
> [   65.129585] done.
> [   65.141314] BUG: unable to handle kernel paging request at ffff88042b957e40

I mean the failure mode, not the particular exception type. :-)

You always saw it in a user space task after kernel resume:

> [   65.141340] Call Trace:
> [   65.141344]  [<ffffffff81181e1e>] ? getname_flags+0x5e/0x1b0
> [   65.141346]  [<ffffffff811782bf>] ? cp_new_stat+0x10f/0x120
> [   65.141348]  [<ffffffff810bb33a>] ? ktime_get_ts64+0x4a/0xf0
> [   65.141353]  [<ffffffff81185fc7>] ? poll_select_copy_remaining+0xe7/0x130
> [   65.141355]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [   65.141356]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [   65.141358]  [<ffffffff81688e72>] entry_SYSCALL_64_fastpath+0xa5/0xa7

[cut]

> [  381.850792] Call Trace:
> [  381.850795]  [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
> [  381.850797]  [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
> [  381.850799]  [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
> [  381.850800]  [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
> [  381.850802]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [  381.850804]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [  381.850806]  [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7

[cut]

> [   49.022675] Call Trace:
> [   49.022680]  [<ffffffff8117f8ae>] ? getname_flags+0x5e/0x1b0
> [   49.022683]  [<ffffffff81175d5f>] ? cp_new_stat+0x10f/0x120
> [   49.022686]  [<ffffffff810b9eca>] ? ktime_get_ts64+0x4a/0xf0
> [   49.022689]  [<ffffffff81183a57>] ? poll_select_copy_remaining+0xe7/0x130
> [   49.022692]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [   49.022695]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [   49.022698]  [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7

[cut]

> [   39.636905] Call Trace:
> [   39.636908]  [<ffffffff8117f8be>] ? getname_flags+0x5e/0x1b0
> [   39.636910]  [<ffffffff81175d6f>] ? cp_new_stat+0x10f/0x120
> [   39.636912]  [<ffffffff810b9eaa>] ? ktime_get_ts64+0x4a/0xf0
> [   39.636917]  [<ffffffff81183a67>] ? poll_select_copy_remaining+0xe7/0x130
> [   39.636919]  [<ffffffff8100263a>] exit_to_usermode_loop+0x8a/0xb0
> [   39.636921]  [<ffffffff81002a6b>] syscall_return_slowpath+0x5b/0x70
> [   39.636922]  [<ffffffff81688272>] entry_SYSCALL_64_fastpath+0xa5/0xa7

which is a clear indication of image corruption during restore.

In the Logan's case this happens in swsusp_arch_resume() proper and
the address in RIP is relative to the identity mapping, so the only
place it can happen is the jump to relocated_restore_code.  That's
because before that jump the addresses in RIP are relative to the
kernel text mapping and after it we immediately switch over to the
temporary page tables which are all executable.  So that is the only
place AFAICS.

Also in your case the failure was 100% reproducible, while in the
Logan's case it has happened once so far (so generally it happens once
in a blue moon).

In summary, I'm sure that this is a different issue.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30  3:56                                       ` Logan Gunthorpe
@ 2016-06-30 12:16                                         ` Rafael J. Wysocki
  0 siblings, 0 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 12:16 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Rafael J. Wysocki, Kees Cook, Borislav Petkov, Linus Torvalds,
	Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 5:56 AM, Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
> On 29/06/16 08:55 PM, Rafael J. Wysocki wrote:
>
>> The only thing that comes to mind at this point is that TLBs should be
>> flushed
>> after page tables changes, so please apply the appended and let me know
>> if you see this panic any more with it.
>
>
>
> Ok, I'll build a new kernel tomorrow.

Well, please wait for a while.

I'm looking at the panic log right now and the picture is a bit clearer now.

> But keep in mind the panic is pretty
> rare as I've only seen it once so far after a couple dozen or so hibernates.
> So it may be hard to get a concrete yes or no on whether it fixes the issue.

Right.

> I've got a script to run a bunch of hibernates in a row. I usually only run
> it for a handful of iterations, but I'll try running it for much longer with
> this patch and let you know in a couple days.

As I said, please wait a bit, there may be updates in the meantime. :-)

>From looking at your panic log, the exception happened in
swsusp_arch_resume(), which probably covers restore_image() too,
because that is likely to go into swsusp_arch_resume() in line.

Next, the address in RIP (a) clearly is a page start and (b) is
relative to the identity mapping, so it most likely is the address
from relocated_restore_code.  Moreover, the RCX value is the same
address and the values in the other registers also match exactly the
situation before the jump to relocated_restore_code.  Thus the
exception was triggered by that jump beyond doubt.

Now, if you look above the Oops line, it becomes quite clear what
happened.  Namely, dump_pagetable() (arch/x86/mm/fault.c, line 524 in
4.6) prints the PGD, the PUD, the PMD and the PTE in that order,
unless the lower levels (PTE, PMD) are not present.  In your panic
log, only the PGD and PUD are present, meaning that this is a 1G page
and sure enough it has the NX bit set.

This case was clearly overlooked by relocate_restore_code(), so
updated patch will follow.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-27 14:24                           ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
  2016-06-27 20:08                             ` Borislav Petkov
  2016-06-27 23:33                             ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
@ 2016-06-30 13:17                             ` Rafael J. Wysocki
  2016-06-30 15:05                               ` Borislav Petkov
  2016-06-30 16:11                               ` [PATCH v5] " Rafael J. Wysocki
  2 siblings, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 13:17 UTC (permalink / raw)
  To: Logan Gunthorpe, Borislav Petkov
  Cc: Kees Cook, Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v3 -> v4: The new relocate_restore_code() didn't cover the case when
  the page containing the relocated code was mapped via a huge (1G)
  page and it didn't clear the NX bit in that case, which led to the
  page fault reported by Logan at
  http://marc.info/?l=linux-kernel&m=146725158621626&w=4.
  Fix that by making relocate_restore_code() handle the huge page
  mapping case too.

I've retained the Tested-by: from Kees as this change has nothing to do with
whether or not KASLR works with hibernation after this patch.

Logan, please test this one thoroughly.  I'll give it at least one week in
linux-next going forward.

Boris, please test it on the machine where we saw memory corruption with
the previous versions if poss.

Thanks,
Rafael

---
 arch/x86/power/hibernate_64.c     |   97 +++++++++++++++++++++++++++++++++-----
 arch/x86/power/hibernate_asm_64.S |   55 +++++++++------------
 2 files changed, 109 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
 #include <asm/mtrr.h>
 #include <asm/sections.h>
 #include <asm/suspend.h>
+#include <asm/tlbflush.h>
 
 /* Defined in hibernate_asm_64.S */
 extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
 
 pgd_t *temp_level4_pgt __visible;
 
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+	pmd_t *pmd;
+	pud_t *pud;
+
+	/*
+	 * The new mapping only has to cover the page containing the image
+	 * kernel's entry point (jump_address_phys), because the switch over to
+	 * it is carried out by relocated code running from a page allocated
+	 * specifically for this purpose and covered by the identity mapping, so
+	 * the temporary kernel text mapping is only needed for the final jump.
+	 * Moreover, in that mapping the virtual address of the image kernel's
+	 * entry point must be the same as its virtual address in the image
+	 * kernel (restore_jump_address), so the image kernel's
+	 * restore_registers() code doesn't find itself in a different area of
+	 * the virtual address space after switching over to the original page
+	 * tables used by the image kernel.
+	 */
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pmd(pmd + pmd_index(restore_jump_address),
+		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+	set_pud(pud + pud_index(restore_jump_address),
+		__pud(__pa(pmd) | _KERNPG_TABLE));
+	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+		__pgd(__pa(pud) | _KERNPG_TABLE));
+
+	return 0;
+}
 
 static void *alloc_pgt_page(void *context)
 {
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
 	if (!temp_level4_pgt)
 		return -ENOMEM;
 
-	/* It is safe to reuse the original kernel mapping */
-	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+	/* Prepare a temporary mapping for the kernel text */
+	result = set_up_temporary_text_mapping();
+	if (result)
+		return result;
 
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,50 @@ static int set_up_temporary_mappings(voi
 	return 0;
 }
 
+static int relocate_restore_code(void)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+
+	relocated_restore_code = get_safe_page(GFP_ATOMIC);
+	if (!relocated_restore_code)
+		return -ENOMEM;
+
+	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+	/* Make the page containing the relocated code executable */
+	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+	pud = pud_offset(pgd, relocated_restore_code);
+	if (pud_large(*pud)) {
+		set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
+	} else {
+		pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
+
+		if (pmd_large(*pmd)) {
+			set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+		} else {
+			pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+			set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+		}
+	}
+	flush_tlb_all();
+
+	return 0;
+}
+
 int swsusp_arch_resume(void)
 {
 	int error;
 
 	/* We have got enough memory and from now on we cannot recover */
-	if ((error = set_up_temporary_mappings()))
+	error = set_up_temporary_mappings();
+	if (error)
 		return error;
 
-	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
-	if (!relocated_restore_code)
-		return -ENOMEM;
-	memcpy(relocated_restore_code, &core_restore_code,
-	       &restore_registers - &core_restore_code);
+	error = relocate_restore_code();
+	if (error)
+		return error;
 
 	restore_image();
 	return 0;
@@ -109,11 +179,12 @@ int pfn_is_nosave(unsigned long pfn)
 
 struct restore_data_record {
 	unsigned long jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +197,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = (unsigned long)&restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +214,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	/* save the address of restore_registers */
-	movq	$restore_registers, %rax
-	movq	%rax, restore_jump_address(%rip)
 	/* save cr3 */
 	movq	%cr3, %rax
 	movq	%rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
 ENDPROC(swsusp_arch_suspend)
 
 ENTRY(restore_image)
-	/* switch to temporary page tables */
-	movq	$__PAGE_OFFSET, %rdx
-	movq	temp_level4_pgt(%rip), %rax
-	subq	%rdx, %rax
-	movq	%rax, %cr3
-	/* Flush TLB */
-	movq	mmu_cr4_features(%rip), %rax
-	movq	%rax, %rdx
-	andq	$~(X86_CR4_PGE), %rdx
-	movq	%rdx, %cr4;  # turn off PGE
-	movq	%cr3, %rcx;  # flush TLB
-	movq	%rcx, %cr3;
-	movq	%rax, %cr4;  # turn PGE back on
-
 	/* prepare to jump to the image kernel */
-	movq	restore_jump_address(%rip), %rax
-	movq	restore_cr3(%rip), %rbx
+	movq	restore_jump_address(%rip), %r8
+	movq	restore_cr3(%rip), %r9
+
+	/* prepare to switch to temporary page tables */
+	movq	temp_level4_pgt(%rip), %rax
+	movq	mmu_cr4_features(%rip), %rbx
 
 	/* prepare to copy image data to their original locations */
 	movq	restore_pblist(%rip), %rdx
+
+	/* jump to relocated restore code */
 	movq	relocated_restore_code(%rip), %rcx
 	jmpq	*%rcx
 
 	/* code below has been relocated to a safe page */
 ENTRY(core_restore_code)
+	/* switch to temporary page tables */
+	movq	$__PAGE_OFFSET, %rcx
+	subq	%rcx, %rax
+	movq	%rax, %cr3
+	/* flush TLB */
+	movq	%rbx, %rcx
+	andq	$~(X86_CR4_PGE), %rcx
+	movq	%rcx, %cr4;  # turn off PGE
+	movq	%cr3, %rcx;  # flush TLB
+	movq	%rcx, %cr3;
+	movq	%rbx, %cr4;  # turn PGE back on
 .Lloop:
 	testq	%rdx, %rdx
 	jz	.Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
 	/* progress to the next pbe */
 	movq	pbe_next(%rdx), %rdx
 	jmp	.Lloop
+
 .Ldone:
 	/* jump to the restore_registers address from the image header */
-	jmpq	*%rax
-	/*
-	 * NOTE: This assumes that the boot kernel's text mapping covers the
-	 * image kernel's page containing restore_registers and the address of
-	 * this page is the same as in the image kernel's text mapping (it
-	 * should always be true, because the text mapping is linear, starting
-	 * from 0, and is supposed to cover the entire kernel text for every
-	 * kernel).
-	 *
-	 * code below belongs to the image kernel
-	 */
+	jmpq	*%r8
 
+	 /* code below belongs to the image kernel */
+	.align PAGE_SIZE
 ENTRY(restore_registers)
 	FRAME_BEGIN
 	/* go back to the original page tables */
-	movq    %rbx, %cr3
+	movq    %r9, %cr3
 
 	/* Flush TLB, including "global" things (vmalloc) */
 	movq	mmu_cr4_features(%rip), %rax

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 13:17                             ` [PATCH v4] " Rafael J. Wysocki
@ 2016-06-30 15:05                               ` Borislav Petkov
  2016-06-30 15:17                                 ` Rafael J. Wysocki
  2016-06-30 16:11                               ` [PATCH v5] " Rafael J. Wysocki
  1 sibling, 1 reply; 39+ messages in thread
From: Borislav Petkov @ 2016-06-30 15:05 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Logan Gunthorpe, Kees Cook, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).

...

> +static int relocate_restore_code(void)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +
> +	relocated_restore_code = get_safe_page(GFP_ATOMIC);
> +	if (!relocated_restore_code)
> +		return -ENOMEM;
> +
> +	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> +	/* Make the page containing the relocated code executable */
> +	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> +	pud = pud_offset(pgd, relocated_restore_code);
> +	if (pud_large(*pud)) {
> +		set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
> +	} else {
> +		pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
> +
> +		if (pmd_large(*pmd)) {
> +			set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> +		} else {
> +			pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> +			set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> +		}
> +	}
> +	flush_tlb_all();

I know you want to flush TLBs but this causes the splat below on the
resume kernel.

Most likely because:

resume_target_kernel() does local_irq_disable() and then

swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()

and smp_call_function_many() doesn't like it when IRQs are disabled.

[    7.613645] Disabling non-boot CPUs ...
[    7.902408] ------------[ cut here ]------------
[    7.907106] WARNING: CPU: 0 PID: 1 at kernel/smp.c:416 smp_call_function_many+0xb6/0x260
[    7.915319] Modules linked in:
[    7.918501] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-rc5+ #11
[    7.924931] Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A97 EVO R2.0, BIOS 1503 01/16/2013
[    7.934967]  0000000000000000 ffff88042b957cf8 ffffffff812ac1c3 0000000000000000
[    7.942664]  0000000000000000 ffff88042b957d38 ffffffff8105435d 000001a02b957d28
[    7.950369]  0000000000000000 0000000000000000 ffffffff8104d420 0000000000000000
[    7.958072] Call Trace:
[    7.960598]  [<ffffffff812ac1c3>] dump_stack+0x67/0x94
[    7.965815]  [<ffffffff8105435d>] __warn+0xdd/0x100
[    7.970771]  [<ffffffff8104d420>] ? leave_mm+0xc0/0xc0
[    7.975981]  [<ffffffff8105444d>] warn_slowpath_null+0x1d/0x20
[    7.981891]  [<ffffffff810cb526>] smp_call_function_many+0xb6/0x260
[    7.988236]  [<ffffffff8104d420>] ? leave_mm+0xc0/0xc0
[    7.993452]  [<ffffffff810cb716>] smp_call_function+0x46/0x80
[    7.999277]  [<ffffffff8104d420>] ? leave_mm+0xc0/0xc0
[    8.004494]  [<ffffffff810cb78e>] on_each_cpu+0x3e/0xa0
[    8.009790]  [<ffffffff81098e00>] ? hibernation_restore+0x130/0x130
[    8.016135]  [<ffffffff8104debc>] flush_tlb_all+0x1c/0x20
[    8.021613]  [<ffffffff815bd8d4>] swsusp_arch_resume+0x254/0x2b0
[    8.027696]  [<ffffffff815bd660>] ? restore_processor_state+0x2f0/0x2f0
[    8.034387]  [<ffffffff81098d9d>] hibernation_restore+0xcd/0x130
[    8.040464]  [<ffffffff81112fbd>] software_resume.part.6+0x1f9/0x25b
[    8.046894]  [<ffffffff81098e26>] software_resume+0x26/0x30
[    8.052545]  [<ffffffff81000449>] do_one_initcall+0x59/0x190
[    8.058282]  [<ffffffff81071b3c>] ? parse_args+0x26c/0x3f0
[    8.063867]  [<ffffffff8168b000>] ? _raw_read_unlock_irqrestore+0x30/0x60
[    8.070730]  [<ffffffff81cd5002>] kernel_init_freeable+0x118/0x19e
[    8.076986]  [<ffffffff816851ae>] kernel_init+0xe/0x100
[    8.082290]  [<ffffffff8168b75f>] ret_from_fork+0x1f/0x40
[    8.087768]  [<ffffffff816851a0>] ? rest_init+0x90/0x90
[    8.093073] ---[ end trace 6361ce069253f25c ]---

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 15:05                               ` Borislav Petkov
@ 2016-06-30 15:17                                 ` Rafael J. Wysocki
  2016-06-30 15:24                                   ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 15:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Rafael J. Wysocki, Logan Gunthorpe, Kees Cook, Linus Torvalds,
	Rafael J. Wysocki, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Logan Gunthorpe reports that hibernation stopped working reliably for
>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>> and rodata).
>
> ...
>
>> +static int relocate_restore_code(void)
>> +{
>> +     pgd_t *pgd;
>> +     pud_t *pud;
>> +
>> +     relocated_restore_code = get_safe_page(GFP_ATOMIC);
>> +     if (!relocated_restore_code)
>> +             return -ENOMEM;
>> +
>> +     memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>> +
>> +     /* Make the page containing the relocated code executable */
>> +     pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>> +     pud = pud_offset(pgd, relocated_restore_code);
>> +     if (pud_large(*pud)) {
>> +             set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>> +     } else {
>> +             pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>> +
>> +             if (pmd_large(*pmd)) {
>> +                     set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>> +             } else {
>> +                     pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>> +
>> +                     set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>> +             }
>> +     }
>> +     flush_tlb_all();
>
> I know you want to flush TLBs but this causes the splat below on the
> resume kernel.
>
> Most likely because:
>
> resume_target_kernel() does local_irq_disable() and then
>
> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>
> and smp_call_function_many() doesn't like it when IRQs are disabled.

Right.

Can I invoke __flush_tlb_all() from here to flush the TLB on the local
CPU?  That should be sufficient IMO.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 15:17                                 ` Rafael J. Wysocki
@ 2016-06-30 15:24                                   ` Andy Lutomirski
  2016-06-30 15:29                                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2016-06-30 15:24 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Borislav Petkov, Rafael J. Wysocki, Logan Gunthorpe, Kees Cook,
	Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 8:17 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
>> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>> and rodata).
>>
>> ...
>>
>>> +static int relocate_restore_code(void)
>>> +{
>>> +     pgd_t *pgd;
>>> +     pud_t *pud;
>>> +
>>> +     relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>> +     if (!relocated_restore_code)
>>> +             return -ENOMEM;
>>> +
>>> +     memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>>> +
>>> +     /* Make the page containing the relocated code executable */
>>> +     pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>>> +     pud = pud_offset(pgd, relocated_restore_code);
>>> +     if (pud_large(*pud)) {
>>> +             set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>>> +     } else {
>>> +             pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>>> +
>>> +             if (pmd_large(*pmd)) {
>>> +                     set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>> +             } else {
>>> +                     pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>>> +
>>> +                     set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>> +             }
>>> +     }
>>> +     flush_tlb_all();
>>
>> I know you want to flush TLBs but this causes the splat below on the
>> resume kernel.
>>
>> Most likely because:
>>
>> resume_target_kernel() does local_irq_disable() and then
>>
>> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>>
>> and smp_call_function_many() doesn't like it when IRQs are disabled.
>
> Right.
>
> Can I invoke __flush_tlb_all() from here to flush the TLB on the local
> CPU?  That should be sufficient IMO.

Yes, unless there's another CPU already up at the time, but that seems unlikely.

FWIW, the CPU can and will cache the NX bit and the CPU will *not*
refresh the TLB before sending a page fault.  You definitely need to
flush.  I wonder if this is also why you're seeing a certain amount of
random corruption.

--Andy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 15:24                                   ` Andy Lutomirski
@ 2016-06-30 15:29                                     ` Rafael J. Wysocki
  2016-06-30 17:23                                       ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 15:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rafael J. Wysocki, Borislav Petkov, Rafael J. Wysocki,
	Logan Gunthorpe, Kees Cook, Linus Torvalds, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 5:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jun 30, 2016 at 8:17 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
>>> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>>> and rodata).
>>>
>>> ...
>>>
>>>> +static int relocate_restore_code(void)
>>>> +{
>>>> +     pgd_t *pgd;
>>>> +     pud_t *pud;
>>>> +
>>>> +     relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>>> +     if (!relocated_restore_code)
>>>> +             return -ENOMEM;
>>>> +
>>>> +     memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>>>> +
>>>> +     /* Make the page containing the relocated code executable */
>>>> +     pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>>>> +     pud = pud_offset(pgd, relocated_restore_code);
>>>> +     if (pud_large(*pud)) {
>>>> +             set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>>>> +     } else {
>>>> +             pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>>>> +
>>>> +             if (pmd_large(*pmd)) {
>>>> +                     set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>>> +             } else {
>>>> +                     pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>>>> +
>>>> +                     set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>>> +             }
>>>> +     }
>>>> +     flush_tlb_all();
>>>
>>> I know you want to flush TLBs but this causes the splat below on the
>>> resume kernel.
>>>
>>> Most likely because:
>>>
>>> resume_target_kernel() does local_irq_disable() and then
>>>
>>> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>>>
>>> and smp_call_function_many() doesn't like it when IRQs are disabled.
>>
>> Right.
>>
>> Can I invoke __flush_tlb_all() from here to flush the TLB on the local
>> CPU?  That should be sufficient IMO.
>
> Yes, unless there's another CPU already up at the time, but that seems unlikely.

They are all offline at that time.

OK, let me do that then.

> FWIW, the CPU can and will cache the NX bit and the CPU will *not*
> refresh the TLB before sending a page fault.  You definitely need to
> flush.

OK

>  I wonder if this is also why you're seeing a certain amount of
> random corruption.

Well, the corruption we saw was not really random.  I can give you all
of the details if you're interested, maybe you'll be able to figure
out what might be happening. :-)

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 13:17                             ` [PATCH v4] " Rafael J. Wysocki
  2016-06-30 15:05                               ` Borislav Petkov
@ 2016-06-30 16:11                               ` Rafael J. Wysocki
  2016-06-30 17:02                                 ` Borislav Petkov
  2016-06-30 21:47                                 ` Logan Gunthorpe
  1 sibling, 2 replies; 39+ messages in thread
From: Rafael J. Wysocki @ 2016-06-30 16:11 UTC (permalink / raw)
  To: Logan Gunthorpe, Borislav Petkov
  Cc: Kees Cook, Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Logan Gunthorpe reports that hibernation stopped working reliably for
him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
and rodata).

That turns out to be a consequence of a long-standing issue with the
64-bit image restoration code on x86, which is that the temporary
page tables set up by it to avoid page tables corruption when the
last bits of the image kernel's memory contents are copied into
their original page frames re-use the boot kernel's text mapping,
but that mapping may very well get corrupted just like any other
part of the page tables.  Of course, if that happens, the final
jump to the image kernel's entry point will go to nowhere.

The exact reason why commit ab76f7b4ab23 matters here is that it
sometimes causes a PMD of a large page to be split into PTEs
that are allocated dynamically and get corrupted during image
restoration as described above.

To fix that issue note that the code copying the last bits of the
image kernel's memory contents to the page frames occupied by them
previoulsy doesn't use the kernel text mapping, because it runs from
a special page covered by the identity mapping set up for that code
from scratch.  Hence, the kernel text mapping is only needed before
that code starts to run and then it will only be used just for the
final jump to the image kernel's entry point.

Accordingly, the temporary page tables set up in swsusp_arch_resume()
on x86-64 need to contain the kernel text mapping too.  That mapping
is only going to be used for the final jump to the image kernel, so
it only needs to cover the image kernel's entry point, because the
first thing the image kernel does after getting control back is to
switch over to its own original page tables.  Moreover, the virtual
address of the image kernel's entry point in that mapping has to be
the same as the one mapped by the image kernel's page tables.

With that in mind, modify the x86-64's arch_hibernation_header_save()
and arch_hibernation_header_restore() routines to pass the physical
address of the image kernel's entry point (in addition to its virtual
address) to the boot kernel (a small piece of assembly code involved
in passing the entry point's virtual address to the image kernel is
not necessary any more after that, so drop it).  Update RESTORE_MAGIC
too to reflect the image header format change.

Next, in set_up_temporary_mappings(), use the physical and virtual
addresses of the image kernel's entry point passed in the image
header to set up a minimum kernel text mapping (using memory pages
that won't be overwritten by the image kernel's memory contents) that
will map those addresses to each other as appropriate.

This makes the concern about the possible corruption of the original
boot kernel text mapping go away and if the the minimum kernel text
mapping used for the final jump marks the image kernel's entry point
memory as executable, the jump to it is guaraneed to succeed.

Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v4 -> v5: Replace flush_tlb_all() in relocate_restore_code() with
  __flush_tlb_all() (which still should be sufficient) to avoid
  complaints from smp_call_function_many() about disabled interrupts.

v3 -> v4: The new relocate_restore_code() didn't cover the case when
  the page containing the relocated code was mapped via a huge (1G)
  page and it didn't clear the NX bit in that case, which led to the
  page fault reported by Logan at
  http://marc.info/?l=linux-kernel&m=146725158621626&w=4.
  Fix that by making relocate_restore_code() handle the huge page
  mapping case too.

I've retained the Tested-by: from Kees as these changes have nothing to do with
whether or not KASLR works with hibernation after this patch.

Logan, please test this one thoroughly.  I'll give it at least one week in
linux-next going forward.

Boris, please test it on the machine where we saw memory corruption with
the previous versions if poss.

---
 arch/x86/power/hibernate_64.c     |   97 +++++++++++++++++++++++++++++++++-----
 arch/x86/power/hibernate_asm_64.S |   55 +++++++++------------
 2 files changed, 109 insertions(+), 43 deletions(-)

Index: linux-pm/arch/x86/power/hibernate_64.c
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_64.c
+++ linux-pm/arch/x86/power/hibernate_64.c
@@ -19,6 +19,7 @@
 #include <asm/mtrr.h>
 #include <asm/sections.h>
 #include <asm/suspend.h>
+#include <asm/tlbflush.h>
 
 /* Defined in hibernate_asm_64.S */
 extern asmlinkage __visible int restore_image(void);
@@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
  * kernel's text (this value is passed in the image header).
  */
 unsigned long restore_jump_address __visible;
+unsigned long jump_address_phys;
 
 /*
  * Value of the cr3 register from before the hibernation (this value is passed
@@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
 
 pgd_t *temp_level4_pgt __visible;
 
-void *relocated_restore_code __visible;
+unsigned long relocated_restore_code __visible;
+
+static int set_up_temporary_text_mapping(void)
+{
+	pmd_t *pmd;
+	pud_t *pud;
+
+	/*
+	 * The new mapping only has to cover the page containing the image
+	 * kernel's entry point (jump_address_phys), because the switch over to
+	 * it is carried out by relocated code running from a page allocated
+	 * specifically for this purpose and covered by the identity mapping, so
+	 * the temporary kernel text mapping is only needed for the final jump.
+	 * Moreover, in that mapping the virtual address of the image kernel's
+	 * entry point must be the same as its virtual address in the image
+	 * kernel (restore_jump_address), so the image kernel's
+	 * restore_registers() code doesn't find itself in a different area of
+	 * the virtual address space after switching over to the original page
+	 * tables used by the image kernel.
+	 */
+	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
+	if (!pud)
+		return -ENOMEM;
+
+	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
+	if (!pmd)
+		return -ENOMEM;
+
+	set_pmd(pmd + pmd_index(restore_jump_address),
+		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
+	set_pud(pud + pud_index(restore_jump_address),
+		__pud(__pa(pmd) | _KERNPG_TABLE));
+	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
+		__pgd(__pa(pud) | _KERNPG_TABLE));
+
+	return 0;
+}
 
 static void *alloc_pgt_page(void *context)
 {
@@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
 	if (!temp_level4_pgt)
 		return -ENOMEM;
 
-	/* It is safe to reuse the original kernel mapping */
-	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
-		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
+	/* Prepare a temporary mapping for the kernel text */
+	result = set_up_temporary_text_mapping();
+	if (result)
+		return result;
 
 	/* Set up the direct mapping from scratch */
 	for (i = 0; i < nr_pfn_mapped; i++) {
@@ -78,19 +117,50 @@ static int set_up_temporary_mappings(voi
 	return 0;
 }
 
+static int relocate_restore_code(void)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+
+	relocated_restore_code = get_safe_page(GFP_ATOMIC);
+	if (!relocated_restore_code)
+		return -ENOMEM;
+
+	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
+
+	/* Make the page containing the relocated code executable */
+	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
+	pud = pud_offset(pgd, relocated_restore_code);
+	if (pud_large(*pud)) {
+		set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
+	} else {
+		pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
+
+		if (pmd_large(*pmd)) {
+			set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
+		} else {
+			pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
+
+			set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
+		}
+	}
+	__flush_tlb_all();
+
+	return 0;
+}
+
 int swsusp_arch_resume(void)
 {
 	int error;
 
 	/* We have got enough memory and from now on we cannot recover */
-	if ((error = set_up_temporary_mappings()))
+	error = set_up_temporary_mappings();
+	if (error)
 		return error;
 
-	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
-	if (!relocated_restore_code)
-		return -ENOMEM;
-	memcpy(relocated_restore_code, &core_restore_code,
-	       &restore_registers - &core_restore_code);
+	error = relocate_restore_code();
+	if (error)
+		return error;
 
 	restore_image();
 	return 0;
@@ -109,11 +179,12 @@ int pfn_is_nosave(unsigned long pfn)
 
 struct restore_data_record {
 	unsigned long jump_address;
+	unsigned long jump_address_phys;
 	unsigned long cr3;
 	unsigned long magic;
 };
 
-#define RESTORE_MAGIC	0x0123456789ABCDEFUL
+#define RESTORE_MAGIC	0x123456789ABCDEF0UL
 
 /**
  *	arch_hibernation_header_save - populate the architecture specific part
@@ -126,7 +197,8 @@ int arch_hibernation_header_save(void *a
 
 	if (max_size < sizeof(struct restore_data_record))
 		return -EOVERFLOW;
-	rdr->jump_address = restore_jump_address;
+	rdr->jump_address = (unsigned long)&restore_registers;
+	rdr->jump_address_phys = __pa_symbol(&restore_registers);
 	rdr->cr3 = restore_cr3;
 	rdr->magic = RESTORE_MAGIC;
 	return 0;
@@ -142,6 +214,7 @@ int arch_hibernation_header_restore(void
 	struct restore_data_record *rdr = addr;
 
 	restore_jump_address = rdr->jump_address;
+	jump_address_phys = rdr->jump_address_phys;
 	restore_cr3 = rdr->cr3;
 	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
 }
Index: linux-pm/arch/x86/power/hibernate_asm_64.S
===================================================================
--- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
+++ linux-pm/arch/x86/power/hibernate_asm_64.S
@@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
 	pushfq
 	popq	pt_regs_flags(%rax)
 
-	/* save the address of restore_registers */
-	movq	$restore_registers, %rax
-	movq	%rax, restore_jump_address(%rip)
 	/* save cr3 */
 	movq	%cr3, %rax
 	movq	%rax, restore_cr3(%rip)
@@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
 ENDPROC(swsusp_arch_suspend)
 
 ENTRY(restore_image)
-	/* switch to temporary page tables */
-	movq	$__PAGE_OFFSET, %rdx
-	movq	temp_level4_pgt(%rip), %rax
-	subq	%rdx, %rax
-	movq	%rax, %cr3
-	/* Flush TLB */
-	movq	mmu_cr4_features(%rip), %rax
-	movq	%rax, %rdx
-	andq	$~(X86_CR4_PGE), %rdx
-	movq	%rdx, %cr4;  # turn off PGE
-	movq	%cr3, %rcx;  # flush TLB
-	movq	%rcx, %cr3;
-	movq	%rax, %cr4;  # turn PGE back on
-
 	/* prepare to jump to the image kernel */
-	movq	restore_jump_address(%rip), %rax
-	movq	restore_cr3(%rip), %rbx
+	movq	restore_jump_address(%rip), %r8
+	movq	restore_cr3(%rip), %r9
+
+	/* prepare to switch to temporary page tables */
+	movq	temp_level4_pgt(%rip), %rax
+	movq	mmu_cr4_features(%rip), %rbx
 
 	/* prepare to copy image data to their original locations */
 	movq	restore_pblist(%rip), %rdx
+
+	/* jump to relocated restore code */
 	movq	relocated_restore_code(%rip), %rcx
 	jmpq	*%rcx
 
 	/* code below has been relocated to a safe page */
 ENTRY(core_restore_code)
+	/* switch to temporary page tables */
+	movq	$__PAGE_OFFSET, %rcx
+	subq	%rcx, %rax
+	movq	%rax, %cr3
+	/* flush TLB */
+	movq	%rbx, %rcx
+	andq	$~(X86_CR4_PGE), %rcx
+	movq	%rcx, %cr4;  # turn off PGE
+	movq	%cr3, %rcx;  # flush TLB
+	movq	%rcx, %cr3;
+	movq	%rbx, %cr4;  # turn PGE back on
 .Lloop:
 	testq	%rdx, %rdx
 	jz	.Ldone
@@ -96,24 +96,17 @@ ENTRY(core_restore_code)
 	/* progress to the next pbe */
 	movq	pbe_next(%rdx), %rdx
 	jmp	.Lloop
+
 .Ldone:
 	/* jump to the restore_registers address from the image header */
-	jmpq	*%rax
-	/*
-	 * NOTE: This assumes that the boot kernel's text mapping covers the
-	 * image kernel's page containing restore_registers and the address of
-	 * this page is the same as in the image kernel's text mapping (it
-	 * should always be true, because the text mapping is linear, starting
-	 * from 0, and is supposed to cover the entire kernel text for every
-	 * kernel).
-	 *
-	 * code below belongs to the image kernel
-	 */
+	jmpq	*%r8
 
+	 /* code below belongs to the image kernel */
+	.align PAGE_SIZE
 ENTRY(restore_registers)
 	FRAME_BEGIN
 	/* go back to the original page tables */
-	movq    %rbx, %cr3
+	movq    %r9, %cr3
 
 	/* Flush TLB, including "global" things (vmalloc) */
 	movq	mmu_cr4_features(%rip), %rax

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 16:11                               ` [PATCH v5] " Rafael J. Wysocki
@ 2016-06-30 17:02                                 ` Borislav Petkov
  2016-06-30 21:47                                 ` Logan Gunthorpe
  1 sibling, 0 replies; 39+ messages in thread
From: Borislav Petkov @ 2016-06-30 17:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Logan Gunthorpe, Kees Cook, Linus Torvalds, Rafael J. Wysocki,
	Thomas Gleixner, Ingo Molnar, Peter Zijlstra, lkml,
	Rafael J. Wysocki, Andy Lutomirski, Brian Gerst, Denys Vlasenko,
	H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 06:11:41PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
> 
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.

...

> Boris, please test it on the machine where we saw memory corruption with
> the previous versions if poss.

Looks good. 5 runs passed without a hiccup.

Reported-and-tested-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 15:29                                     ` Rafael J. Wysocki
@ 2016-06-30 17:23                                       ` Andy Lutomirski
  0 siblings, 0 replies; 39+ messages in thread
From: Andy Lutomirski @ 2016-06-30 17:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Borislav Petkov, Rafael J. Wysocki, Logan Gunthorpe, Kees Cook,
	Linus Torvalds, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
	lkml, Rafael J. Wysocki, Andy Lutomirski, Brian Gerst,
	Denys Vlasenko, H. Peter Anvin, Linux PM list, Stephen Smalley

On Thu, Jun 30, 2016 at 8:29 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Jun 30, 2016 at 5:24 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Thu, Jun 30, 2016 at 8:17 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>> On Thu, Jun 30, 2016 at 5:05 PM, Borislav Petkov <bp@alien8.de> wrote:
>>>> On Thu, Jun 30, 2016 at 03:17:20PM +0200, Rafael J. Wysocki wrote:
>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>>
>>>>> Logan Gunthorpe reports that hibernation stopped working reliably for
>>>>> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
>>>>> and rodata).
>>>>
>>>> ...
>>>>
>>>>> +static int relocate_restore_code(void)
>>>>> +{
>>>>> +     pgd_t *pgd;
>>>>> +     pud_t *pud;
>>>>> +
>>>>> +     relocated_restore_code = get_safe_page(GFP_ATOMIC);
>>>>> +     if (!relocated_restore_code)
>>>>> +             return -ENOMEM;
>>>>> +
>>>>> +     memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
>>>>> +
>>>>> +     /* Make the page containing the relocated code executable */
>>>>> +     pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
>>>>> +     pud = pud_offset(pgd, relocated_restore_code);
>>>>> +     if (pud_large(*pud)) {
>>>>> +             set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
>>>>> +     } else {
>>>>> +             pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
>>>>> +
>>>>> +             if (pmd_large(*pmd)) {
>>>>> +                     set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
>>>>> +             } else {
>>>>> +                     pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
>>>>> +
>>>>> +                     set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
>>>>> +             }
>>>>> +     }
>>>>> +     flush_tlb_all();
>>>>
>>>> I know you want to flush TLBs but this causes the splat below on the
>>>> resume kernel.
>>>>
>>>> Most likely because:
>>>>
>>>> resume_target_kernel() does local_irq_disable() and then
>>>>
>>>> swsusp_arch_resume() -> relocate_restore_code() -> flush_tlb_all()
>>>>
>>>> and smp_call_function_many() doesn't like it when IRQs are disabled.
>>>
>>> Right.
>>>
>>> Can I invoke __flush_tlb_all() from here to flush the TLB on the local
>>> CPU?  That should be sufficient IMO.
>>
>> Yes, unless there's another CPU already up at the time, but that seems unlikely.
>
> They are all offline at that time.
>
> OK, let me do that then.
>
>> FWIW, the CPU can and will cache the NX bit and the CPU will *not*
>> refresh the TLB before sending a page fault.  You definitely need to
>> flush.
>
> OK
>
>>  I wonder if this is also why you're seeing a certain amount of
>> random corruption.
>
> Well, the corruption we saw was not really random.  I can give you all
> of the details if you're interested, maybe you'll be able to figure
> out what might be happening. :-)

Sure, or at least in abbreviated form.  No guarantee that I'll have
any clue, though.

--Andy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v5] x86/power/64: Fix kernel text mapping corruption during image restoration
  2016-06-30 16:11                               ` [PATCH v5] " Rafael J. Wysocki
  2016-06-30 17:02                                 ` Borislav Petkov
@ 2016-06-30 21:47                                 ` Logan Gunthorpe
  1 sibling, 0 replies; 39+ messages in thread
From: Logan Gunthorpe @ 2016-06-30 21:47 UTC (permalink / raw)
  To: Rafael J. Wysocki, Borislav Petkov
  Cc: Kees Cook, Linus Torvalds, Rafael J. Wysocki, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, lkml, Rafael J. Wysocki,
	Andy Lutomirski, Brian Gerst, Denys Vlasenko, H. Peter Anvin,
	Linux PM list, Stephen Smalley

Hey,

So, I've done a couple hundred hibernate resume cycles with v5 of the 
patch and so far it looks good. I'll let you know if I hit any bugs in 
more typical usage over the coming week.

Thanks,

Logan

On 30/06/16 10:11 AM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Logan Gunthorpe reports that hibernation stopped working reliably for
> him after commit ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table
> and rodata).
>
> That turns out to be a consequence of a long-standing issue with the
> 64-bit image restoration code on x86, which is that the temporary
> page tables set up by it to avoid page tables corruption when the
> last bits of the image kernel's memory contents are copied into
> their original page frames re-use the boot kernel's text mapping,
> but that mapping may very well get corrupted just like any other
> part of the page tables.  Of course, if that happens, the final
> jump to the image kernel's entry point will go to nowhere.
>
> The exact reason why commit ab76f7b4ab23 matters here is that it
> sometimes causes a PMD of a large page to be split into PTEs
> that are allocated dynamically and get corrupted during image
> restoration as described above.
>
> To fix that issue note that the code copying the last bits of the
> image kernel's memory contents to the page frames occupied by them
> previoulsy doesn't use the kernel text mapping, because it runs from
> a special page covered by the identity mapping set up for that code
> from scratch.  Hence, the kernel text mapping is only needed before
> that code starts to run and then it will only be used just for the
> final jump to the image kernel's entry point.
>
> Accordingly, the temporary page tables set up in swsusp_arch_resume()
> on x86-64 need to contain the kernel text mapping too.  That mapping
> is only going to be used for the final jump to the image kernel, so
> it only needs to cover the image kernel's entry point, because the
> first thing the image kernel does after getting control back is to
> switch over to its own original page tables.  Moreover, the virtual
> address of the image kernel's entry point in that mapping has to be
> the same as the one mapped by the image kernel's page tables.
>
> With that in mind, modify the x86-64's arch_hibernation_header_save()
> and arch_hibernation_header_restore() routines to pass the physical
> address of the image kernel's entry point (in addition to its virtual
> address) to the boot kernel (a small piece of assembly code involved
> in passing the entry point's virtual address to the image kernel is
> not necessary any more after that, so drop it).  Update RESTORE_MAGIC
> too to reflect the image header format change.
>
> Next, in set_up_temporary_mappings(), use the physical and virtual
> addresses of the image kernel's entry point passed in the image
> header to set up a minimum kernel text mapping (using memory pages
> that won't be overwritten by the image kernel's memory contents) that
> will map those addresses to each other as appropriate.
>
> This makes the concern about the possible corruption of the original
> boot kernel text mapping go away and if the the minimum kernel text
> mapping used for the final jump marks the image kernel's entry point
> memory as executable, the jump to it is guaraneed to succeed.
>
> Fixes: ab76f7b4ab23 (x86/mm: Set NX on gap between __ex_table and rodata)
> Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2
> Reported-by: Logan Gunthorpe <logang@deltatee.com>
> Tested-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>
> v4 -> v5: Replace flush_tlb_all() in relocate_restore_code() with
>   __flush_tlb_all() (which still should be sufficient) to avoid
>   complaints from smp_call_function_many() about disabled interrupts.
>
> v3 -> v4: The new relocate_restore_code() didn't cover the case when
>   the page containing the relocated code was mapped via a huge (1G)
>   page and it didn't clear the NX bit in that case, which led to the
>   page fault reported by Logan at
>   http://marc.info/?l=linux-kernel&m=146725158621626&w=4.
>   Fix that by making relocate_restore_code() handle the huge page
>   mapping case too.
>
> I've retained the Tested-by: from Kees as these changes have nothing to do with
> whether or not KASLR works with hibernation after this patch.
>
> Logan, please test this one thoroughly.  I'll give it at least one week in
> linux-next going forward.
>
> Boris, please test it on the machine where we saw memory corruption with
> the previous versions if poss.
>
> ---
>  arch/x86/power/hibernate_64.c     |   97 +++++++++++++++++++++++++++++++++-----
>  arch/x86/power/hibernate_asm_64.S |   55 +++++++++------------
>  2 files changed, 109 insertions(+), 43 deletions(-)
>
> Index: linux-pm/arch/x86/power/hibernate_64.c
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_64.c
> +++ linux-pm/arch/x86/power/hibernate_64.c
> @@ -19,6 +19,7 @@
>  #include <asm/mtrr.h>
>  #include <asm/sections.h>
>  #include <asm/suspend.h>
> +#include <asm/tlbflush.h>
>
>  /* Defined in hibernate_asm_64.S */
>  extern asmlinkage __visible int restore_image(void);
> @@ -28,6 +29,7 @@ extern asmlinkage __visible int restore_
>   * kernel's text (this value is passed in the image header).
>   */
>  unsigned long restore_jump_address __visible;
> +unsigned long jump_address_phys;
>
>  /*
>   * Value of the cr3 register from before the hibernation (this value is passed
> @@ -37,7 +39,43 @@ unsigned long restore_cr3 __visible;
>
>  pgd_t *temp_level4_pgt __visible;
>
> -void *relocated_restore_code __visible;
> +unsigned long relocated_restore_code __visible;
> +
> +static int set_up_temporary_text_mapping(void)
> +{
> +	pmd_t *pmd;
> +	pud_t *pud;
> +
> +	/*
> +	 * The new mapping only has to cover the page containing the image
> +	 * kernel's entry point (jump_address_phys), because the switch over to
> +	 * it is carried out by relocated code running from a page allocated
> +	 * specifically for this purpose and covered by the identity mapping, so
> +	 * the temporary kernel text mapping is only needed for the final jump.
> +	 * Moreover, in that mapping the virtual address of the image kernel's
> +	 * entry point must be the same as its virtual address in the image
> +	 * kernel (restore_jump_address), so the image kernel's
> +	 * restore_registers() code doesn't find itself in a different area of
> +	 * the virtual address space after switching over to the original page
> +	 * tables used by the image kernel.
> +	 */
> +	pud = (pud_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pud)
> +		return -ENOMEM;
> +
> +	pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
> +	if (!pmd)
> +		return -ENOMEM;
> +
> +	set_pmd(pmd + pmd_index(restore_jump_address),
> +		__pmd((jump_address_phys & PMD_MASK) | __PAGE_KERNEL_LARGE_EXEC));
> +	set_pud(pud + pud_index(restore_jump_address),
> +		__pud(__pa(pmd) | _KERNPG_TABLE));
> +	set_pgd(temp_level4_pgt + pgd_index(restore_jump_address),
> +		__pgd(__pa(pud) | _KERNPG_TABLE));
> +
> +	return 0;
> +}
>
>  static void *alloc_pgt_page(void *context)
>  {
> @@ -59,9 +97,10 @@ static int set_up_temporary_mappings(voi
>  	if (!temp_level4_pgt)
>  		return -ENOMEM;
>
> -	/* It is safe to reuse the original kernel mapping */
> -	set_pgd(temp_level4_pgt + pgd_index(__START_KERNEL_map),
> -		init_level4_pgt[pgd_index(__START_KERNEL_map)]);
> +	/* Prepare a temporary mapping for the kernel text */
> +	result = set_up_temporary_text_mapping();
> +	if (result)
> +		return result;
>
>  	/* Set up the direct mapping from scratch */
>  	for (i = 0; i < nr_pfn_mapped; i++) {
> @@ -78,19 +117,50 @@ static int set_up_temporary_mappings(voi
>  	return 0;
>  }
>
> +static int relocate_restore_code(void)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +
> +	relocated_restore_code = get_safe_page(GFP_ATOMIC);
> +	if (!relocated_restore_code)
> +		return -ENOMEM;
> +
> +	memcpy((void *)relocated_restore_code, &core_restore_code, PAGE_SIZE);
> +
> +	/* Make the page containing the relocated code executable */
> +	pgd = (pgd_t *)__va(read_cr3()) + pgd_index(relocated_restore_code);
> +	pud = pud_offset(pgd, relocated_restore_code);
> +	if (pud_large(*pud)) {
> +		set_pud(pud, __pud(pud_val(*pud) & ~_PAGE_NX));
> +	} else {
> +		pmd_t *pmd = pmd_offset(pud, relocated_restore_code);
> +
> +		if (pmd_large(*pmd)) {
> +			set_pmd(pmd, __pmd(pmd_val(*pmd) & ~_PAGE_NX));
> +		} else {
> +			pte_t *pte = pte_offset_kernel(pmd, relocated_restore_code);
> +
> +			set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_NX));
> +		}
> +	}
> +	__flush_tlb_all();
> +
> +	return 0;
> +}
> +
>  int swsusp_arch_resume(void)
>  {
>  	int error;
>
>  	/* We have got enough memory and from now on we cannot recover */
> -	if ((error = set_up_temporary_mappings()))
> +	error = set_up_temporary_mappings();
> +	if (error)
>  		return error;
>
> -	relocated_restore_code = (void *)get_safe_page(GFP_ATOMIC);
> -	if (!relocated_restore_code)
> -		return -ENOMEM;
> -	memcpy(relocated_restore_code, &core_restore_code,
> -	       &restore_registers - &core_restore_code);
> +	error = relocate_restore_code();
> +	if (error)
> +		return error;
>
>  	restore_image();
>  	return 0;
> @@ -109,11 +179,12 @@ int pfn_is_nosave(unsigned long pfn)
>
>  struct restore_data_record {
>  	unsigned long jump_address;
> +	unsigned long jump_address_phys;
>  	unsigned long cr3;
>  	unsigned long magic;
>  };
>
> -#define RESTORE_MAGIC	0x0123456789ABCDEFUL
> +#define RESTORE_MAGIC	0x123456789ABCDEF0UL
>
>  /**
>   *	arch_hibernation_header_save - populate the architecture specific part
> @@ -126,7 +197,8 @@ int arch_hibernation_header_save(void *a
>
>  	if (max_size < sizeof(struct restore_data_record))
>  		return -EOVERFLOW;
> -	rdr->jump_address = restore_jump_address;
> +	rdr->jump_address = (unsigned long)&restore_registers;
> +	rdr->jump_address_phys = __pa_symbol(&restore_registers);
>  	rdr->cr3 = restore_cr3;
>  	rdr->magic = RESTORE_MAGIC;
>  	return 0;
> @@ -142,6 +214,7 @@ int arch_hibernation_header_restore(void
>  	struct restore_data_record *rdr = addr;
>
>  	restore_jump_address = rdr->jump_address;
> +	jump_address_phys = rdr->jump_address_phys;
>  	restore_cr3 = rdr->cr3;
>  	return (rdr->magic == RESTORE_MAGIC) ? 0 : -EINVAL;
>  }
> Index: linux-pm/arch/x86/power/hibernate_asm_64.S
> ===================================================================
> --- linux-pm.orig/arch/x86/power/hibernate_asm_64.S
> +++ linux-pm/arch/x86/power/hibernate_asm_64.S
> @@ -44,9 +44,6 @@ ENTRY(swsusp_arch_suspend)
>  	pushfq
>  	popq	pt_regs_flags(%rax)
>
> -	/* save the address of restore_registers */
> -	movq	$restore_registers, %rax
> -	movq	%rax, restore_jump_address(%rip)
>  	/* save cr3 */
>  	movq	%cr3, %rax
>  	movq	%rax, restore_cr3(%rip)
> @@ -57,31 +54,34 @@ ENTRY(swsusp_arch_suspend)
>  ENDPROC(swsusp_arch_suspend)
>
>  ENTRY(restore_image)
> -	/* switch to temporary page tables */
> -	movq	$__PAGE_OFFSET, %rdx
> -	movq	temp_level4_pgt(%rip), %rax
> -	subq	%rdx, %rax
> -	movq	%rax, %cr3
> -	/* Flush TLB */
> -	movq	mmu_cr4_features(%rip), %rax
> -	movq	%rax, %rdx
> -	andq	$~(X86_CR4_PGE), %rdx
> -	movq	%rdx, %cr4;  # turn off PGE
> -	movq	%cr3, %rcx;  # flush TLB
> -	movq	%rcx, %cr3;
> -	movq	%rax, %cr4;  # turn PGE back on
> -
>  	/* prepare to jump to the image kernel */
> -	movq	restore_jump_address(%rip), %rax
> -	movq	restore_cr3(%rip), %rbx
> +	movq	restore_jump_address(%rip), %r8
> +	movq	restore_cr3(%rip), %r9
> +
> +	/* prepare to switch to temporary page tables */
> +	movq	temp_level4_pgt(%rip), %rax
> +	movq	mmu_cr4_features(%rip), %rbx
>
>  	/* prepare to copy image data to their original locations */
>  	movq	restore_pblist(%rip), %rdx
> +
> +	/* jump to relocated restore code */
>  	movq	relocated_restore_code(%rip), %rcx
>  	jmpq	*%rcx
>
>  	/* code below has been relocated to a safe page */
>  ENTRY(core_restore_code)
> +	/* switch to temporary page tables */
> +	movq	$__PAGE_OFFSET, %rcx
> +	subq	%rcx, %rax
> +	movq	%rax, %cr3
> +	/* flush TLB */
> +	movq	%rbx, %rcx
> +	andq	$~(X86_CR4_PGE), %rcx
> +	movq	%rcx, %cr4;  # turn off PGE
> +	movq	%cr3, %rcx;  # flush TLB
> +	movq	%rcx, %cr3;
> +	movq	%rbx, %cr4;  # turn PGE back on
>  .Lloop:
>  	testq	%rdx, %rdx
>  	jz	.Ldone
> @@ -96,24 +96,17 @@ ENTRY(core_restore_code)
>  	/* progress to the next pbe */
>  	movq	pbe_next(%rdx), %rdx
>  	jmp	.Lloop
> +
>  .Ldone:
>  	/* jump to the restore_registers address from the image header */
> -	jmpq	*%rax
> -	/*
> -	 * NOTE: This assumes that the boot kernel's text mapping covers the
> -	 * image kernel's page containing restore_registers and the address of
> -	 * this page is the same as in the image kernel's text mapping (it
> -	 * should always be true, because the text mapping is linear, starting
> -	 * from 0, and is supposed to cover the entire kernel text for every
> -	 * kernel).
> -	 *
> -	 * code below belongs to the image kernel
> -	 */
> +	jmpq	*%r8
>
> +	 /* code below belongs to the image kernel */
> +	.align PAGE_SIZE
>  ENTRY(restore_registers)
>  	FRAME_BEGIN
>  	/* go back to the original page tables */
> -	movq    %rbx, %cr3
> +	movq    %r9, %cr3
>
>  	/* Flush TLB, including "global" things (vmalloc) */
>  	movq	mmu_cr4_features(%rip), %rax
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2016-06-30 21:47 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-17 10:54 ktime_get_ts64() splat during resume Borislav Petkov
2016-06-17 11:53 ` Thomas Gleixner
2016-06-17 13:29   ` Borislav Petkov
2016-06-17 14:33     ` Borislav Petkov
2016-06-17 15:28       ` Rafael J. Wysocki
2016-06-17 16:12         ` Borislav Petkov
2016-06-17 21:03           ` Rafael J. Wysocki
2016-06-18  1:11             ` Rafael J. Wysocki
2016-06-20 14:38             ` Rafael J. Wysocki
2016-06-20 18:29               ` Linus Torvalds
2016-06-20 21:15                 ` Rafael J. Wysocki
2016-06-21  0:05                   ` Rafael J. Wysocki
2016-06-21  1:22                     ` Rafael J. Wysocki
2016-06-21  4:35                       ` Logan Gunthorpe
2016-06-21 11:36                         ` Rafael J. Wysocki
2016-06-21 18:04                         ` Kees Cook
2016-06-21 23:29                           ` Rafael J. Wysocki
2016-06-27 14:24                           ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration (was: Re: ktime_get_ts64() splat during resume) Rafael J. Wysocki
2016-06-27 20:08                             ` Borislav Petkov
2016-06-27 23:33                             ` [PATCH v3] x86/power/64: Fix kernel text mapping corruption during image restoration Logan Gunthorpe
2016-06-29 14:48                               ` Kees Cook
2016-06-30  1:52                                 ` Logan Gunthorpe
2016-06-30  2:20                                   ` Rafael J. Wysocki
2016-06-30  2:55                                     ` Rafael J. Wysocki
2016-06-30  3:56                                       ` Logan Gunthorpe
2016-06-30 12:16                                         ` Rafael J. Wysocki
2016-06-30  9:45                                     ` Borislav Petkov
2016-06-30 11:27                                       ` Rafael J. Wysocki
2016-06-30 13:17                             ` [PATCH v4] " Rafael J. Wysocki
2016-06-30 15:05                               ` Borislav Petkov
2016-06-30 15:17                                 ` Rafael J. Wysocki
2016-06-30 15:24                                   ` Andy Lutomirski
2016-06-30 15:29                                     ` Rafael J. Wysocki
2016-06-30 17:23                                       ` Andy Lutomirski
2016-06-30 16:11                               ` [PATCH v5] " Rafael J. Wysocki
2016-06-30 17:02                                 ` Borislav Petkov
2016-06-30 21:47                                 ` Logan Gunthorpe
2016-06-20  8:17         ` ktime_get_ts64() splat during resume chenyu
2016-06-20 12:21           ` Rafael J. Wysocki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.