* Resume from suspend to RAM broken when using early microcode updates @ 2018-04-11 11:48 Simon Gaiser 2018-04-11 11:51 ` Andrew Cooper 0 siblings, 1 reply; 17+ messages in thread From: Simon Gaiser @ 2018-04-11 11:48 UTC (permalink / raw) To: xen-devel; +Cc: Andrew Cooper, Jan Beulich [-- Attachment #1.1.1: Type: text/plain, Size: 889 bytes --] Hi, when I use early microcode loading with the microcode update with the BTI mitigations, resuming from suspend to RAM is broken. Based on added logging to enter_state() (from power.c) it doesn't survive the local_irq_restore(flags) call (at least a printk() after the call doesn't output anything on the serial console). I guess that some irq handler tries to use IBRS/IBPB. But the microcode is only loaded later. If I simply move the microcode_resume_cpu(0) directly before the local_irq_restore(flags) everything seems to work fine. But I'm not sure if this has unintended consequences. I tested the above with Xen 4.8.3 from Qubes which includes the BTI and microcode patches from staging-4.8. AFAICS there are no commits which changes the affected code or other commits which sound relevant so this probably affected also all the newer branches. Simon [-- Attachment #1.2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] [-- Attachment #2: Type: text/plain, Size: 157 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 11:48 Resume from suspend to RAM broken when using early microcode updates Simon Gaiser @ 2018-04-11 11:51 ` Andrew Cooper 2018-04-11 12:01 ` Simon Gaiser 2018-04-11 12:04 ` Jan Beulich 0 siblings, 2 replies; 17+ messages in thread From: Andrew Cooper @ 2018-04-11 11:51 UTC (permalink / raw) To: Simon Gaiser, xen-devel; +Cc: Jan Beulich On 11/04/18 12:48, Simon Gaiser wrote: > Hi, > > when I use early microcode loading with the microcode update with the > BTI mitigations, resuming from suspend to RAM is broken. > > Based on added logging to enter_state() (from power.c) it doesn't > survive the local_irq_restore(flags) call (at least a printk() after the > call doesn't output anything on the serial console). > > I guess that some irq handler tries to use IBRS/IBPB. But the microcode > is only loaded later. > > If I simply move the microcode_resume_cpu(0) directly before the > local_irq_restore(flags) everything seems to work fine. But I'm not sure > if this has unintended consequences. > > I tested the above with Xen 4.8.3 from Qubes which includes the BTI and > microcode patches from staging-4.8. AFAICS there are no commits which > changes the affected code or other commits which sound relevant so this > probably affected also all the newer branches. S3 support is a very unloved area of the hypervisor. Yes - we definitely need to get microcode reloaded before interrupts are enabled. That said, I would have expected a backtrace complaining about a GP fault if we had hit the use of IBRS/IBPB before the microcode was reloaded. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 11:51 ` Andrew Cooper @ 2018-04-11 12:01 ` Simon Gaiser 2018-04-11 12:11 ` Jan Beulich 2018-04-11 12:12 ` Resume from suspend to RAM broken when using early microcode updates Andrew Cooper 2018-04-11 12:04 ` Jan Beulich 1 sibling, 2 replies; 17+ messages in thread From: Simon Gaiser @ 2018-04-11 12:01 UTC (permalink / raw) To: Andrew Cooper, xen-devel; +Cc: Jan Beulich [-- Attachment #1.1.1: Type: text/plain, Size: 1736 bytes --] Andrew Cooper: > On 11/04/18 12:48, Simon Gaiser wrote: >> Hi, >> >> when I use early microcode loading with the microcode update with the >> BTI mitigations, resuming from suspend to RAM is broken. >> >> Based on added logging to enter_state() (from power.c) it doesn't >> survive the local_irq_restore(flags) call (at least a printk() after the >> call doesn't output anything on the serial console). >> >> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >> is only loaded later. >> >> If I simply move the microcode_resume_cpu(0) directly before the >> local_irq_restore(flags) everything seems to work fine. But I'm not sure >> if this has unintended consequences. >> >> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >> microcode patches from staging-4.8. AFAICS there are no commits which >> changes the affected code or other commits which sound relevant so this >> probably affected also all the newer branches. > > S3 support is a very unloved area of the hypervisor. > > Yes - we definitely need to get microcode reloaded before interrupts are > enabled. Do you see any problems with simply moving microcode_resume_cpu(0) directly before the local_irq_restore(flags) call? (I'm not familiar with the code at all and (early) resume handling sounds like something which is easy to break in non obvious ways) > That said, I would have expected a backtrace complaining about > a GP fault if we had hit the use of IBRS/IBPB before the microcode was > reloaded. Yeah, not sure what's happening here. I don't get any output from after local_irq_restore(flags). If you have some ideas for more debug output I can easily test it. Simon [-- Attachment #1.2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] [-- Attachment #2: Type: text/plain, Size: 157 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 12:01 ` Simon Gaiser @ 2018-04-11 12:11 ` Jan Beulich 2018-04-11 12:17 ` Jan Beulich 2018-04-11 12:12 ` Resume from suspend to RAM broken when using early microcode updates Andrew Cooper 1 sibling, 1 reply; 17+ messages in thread From: Jan Beulich @ 2018-04-11 12:11 UTC (permalink / raw) To: Simon Gaiser; +Cc: Andrew Cooper, xen-devel >>> On 11.04.18 at 14:01, <simon@invisiblethingslab.com> wrote: > Andrew Cooper: >> On 11/04/18 12:48, Simon Gaiser wrote: >>> Hi, >>> >>> when I use early microcode loading with the microcode update with the >>> BTI mitigations, resuming from suspend to RAM is broken. >>> >>> Based on added logging to enter_state() (from power.c) it doesn't >>> survive the local_irq_restore(flags) call (at least a printk() after the >>> call doesn't output anything on the serial console). >>> >>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>> is only loaded later. >>> >>> If I simply move the microcode_resume_cpu(0) directly before the >>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>> if this has unintended consequences. >>> >>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>> microcode patches from staging-4.8. AFAICS there are no commits which >>> changes the affected code or other commits which sound relevant so this >>> probably affected also all the newer branches. >> >> S3 support is a very unloved area of the hypervisor. >> >> Yes - we definitely need to get microcode reloaded before interrupts are >> enabled. > > Do you see any problems with simply moving microcode_resume_cpu(0) > directly before the local_irq_restore(flags) call? (I'm not familiar > with the code at all and (early) resume handling sounds like something > which is easy to break in non obvious ways) Yes, there would be a problem: microcode_resume_cpu() spin_lock()-s almost first thing, and this would break our (simplistic) lock checking. Putting it also ahead of spin_debug_enable() should work otoh. Once at it, cpufreq_add_cpu() should be moved ahead of the enable_cpu label as well, as cpufreq_del_cpu() wasn't called yet at the point of the only goto to that label. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 12:11 ` Jan Beulich @ 2018-04-11 12:17 ` Jan Beulich 2018-04-11 12:46 ` Simon Gaiser 0 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2018-04-11 12:17 UTC (permalink / raw) To: Simon Gaiser; +Cc: Andrew Cooper, xen-devel >>> On 11.04.18 at 14:11, <JBeulich@suse.com> wrote: >>>> On 11.04.18 at 14:01, <simon@invisiblethingslab.com> wrote: >> Andrew Cooper: >>> On 11/04/18 12:48, Simon Gaiser wrote: >>>> Hi, >>>> >>>> when I use early microcode loading with the microcode update with the >>>> BTI mitigations, resuming from suspend to RAM is broken. >>>> >>>> Based on added logging to enter_state() (from power.c) it doesn't >>>> survive the local_irq_restore(flags) call (at least a printk() after the >>>> call doesn't output anything on the serial console). >>>> >>>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>>> is only loaded later. >>>> >>>> If I simply move the microcode_resume_cpu(0) directly before the >>>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>>> if this has unintended consequences. >>>> >>>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>>> microcode patches from staging-4.8. AFAICS there are no commits which >>>> changes the affected code or other commits which sound relevant so this >>>> probably affected also all the newer branches. >>> >>> S3 support is a very unloved area of the hypervisor. >>> >>> Yes - we definitely need to get microcode reloaded before interrupts are >>> enabled. >> >> Do you see any problems with simply moving microcode_resume_cpu(0) >> directly before the local_irq_restore(flags) call? (I'm not familiar >> with the code at all and (early) resume handling sounds like something >> which is easy to break in non obvious ways) > > Yes, there would be a problem: microcode_resume_cpu() > spin_lock()-s almost first thing, and this would break our > (simplistic) lock checking. Putting it also ahead of > spin_debug_enable() should work otoh. > > Once at it, cpufreq_add_cpu() should be moved ahead of the > enable_cpu label as well, as cpufreq_del_cpu() wasn't called > yet at the point of the only goto to that label. And I think console_end_sync() want to be moved earlier then as well. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 12:17 ` Jan Beulich @ 2018-04-11 12:46 ` Simon Gaiser 2018-04-11 13:45 ` Jan Beulich 0 siblings, 1 reply; 17+ messages in thread From: Simon Gaiser @ 2018-04-11 12:46 UTC (permalink / raw) To: Jan Beulich; +Cc: Andrew Cooper, xen-devel [-- Attachment #1.1.1: Type: text/plain, Size: 2217 bytes --] Jan Beulich: >>>> On 11.04.18 at 14:11, <JBeulich@suse.com> wrote: >>>>> On 11.04.18 at 14:01, <simon@invisiblethingslab.com> wrote: >>> Andrew Cooper: >>>> On 11/04/18 12:48, Simon Gaiser wrote: >>>>> Hi, >>>>> >>>>> when I use early microcode loading with the microcode update with the >>>>> BTI mitigations, resuming from suspend to RAM is broken. >>>>> >>>>> Based on added logging to enter_state() (from power.c) it doesn't >>>>> survive the local_irq_restore(flags) call (at least a printk() after the >>>>> call doesn't output anything on the serial console). >>>>> >>>>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>>>> is only loaded later. >>>>> >>>>> If I simply move the microcode_resume_cpu(0) directly before the >>>>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>>>> if this has unintended consequences. >>>>> >>>>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>>>> microcode patches from staging-4.8. AFAICS there are no commits which >>>>> changes the affected code or other commits which sound relevant so this >>>>> probably affected also all the newer branches. >>>> >>>> S3 support is a very unloved area of the hypervisor. >>>> >>>> Yes - we definitely need to get microcode reloaded before interrupts are >>>> enabled. >>> >>> Do you see any problems with simply moving microcode_resume_cpu(0) >>> directly before the local_irq_restore(flags) call? (I'm not familiar >>> with the code at all and (early) resume handling sounds like something >>> which is easy to break in non obvious ways) >> >> Yes, there would be a problem: microcode_resume_cpu() >> spin_lock()-s almost first thing, and this would break our >> (simplistic) lock checking. Putting it also ahead of >> spin_debug_enable() should work otoh. >> >> Once at it, cpufreq_add_cpu() should be moved ahead of the >> enable_cpu label as well, as cpufreq_del_cpu() wasn't called >> yet at the point of the only goto to that label. > > And I think console_end_sync() want to be moved earlier then > as well. Where exactly? console_end_sync() seems to match the position of console_start_sync(). [-- Attachment #1.2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] [-- Attachment #2: Type: text/plain, Size: 157 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 12:46 ` Simon Gaiser @ 2018-04-11 13:45 ` Jan Beulich 2018-04-11 20:14 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Simon Gaiser 0 siblings, 1 reply; 17+ messages in thread From: Jan Beulich @ 2018-04-11 13:45 UTC (permalink / raw) To: Simon Gaiser; +Cc: Andrew Cooper, xen-devel >>> On 11.04.18 at 14:46, <simon@invisiblethingslab.com> wrote: > Jan Beulich: >>>>> On 11.04.18 at 14:11, <JBeulich@suse.com> wrote: >>>>>> On 11.04.18 at 14:01, <simon@invisiblethingslab.com> wrote: >>>> Andrew Cooper: >>>>> On 11/04/18 12:48, Simon Gaiser wrote: >>>>>> Hi, >>>>>> >>>>>> when I use early microcode loading with the microcode update with the >>>>>> BTI mitigations, resuming from suspend to RAM is broken. >>>>>> >>>>>> Based on added logging to enter_state() (from power.c) it doesn't >>>>>> survive the local_irq_restore(flags) call (at least a printk() after the >>>>>> call doesn't output anything on the serial console). >>>>>> >>>>>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>>>>> is only loaded later. >>>>>> >>>>>> If I simply move the microcode_resume_cpu(0) directly before the >>>>>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>>>>> if this has unintended consequences. >>>>>> >>>>>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>>>>> microcode patches from staging-4.8. AFAICS there are no commits which >>>>>> changes the affected code or other commits which sound relevant so this >>>>>> probably affected also all the newer branches. >>>>> >>>>> S3 support is a very unloved area of the hypervisor. >>>>> >>>>> Yes - we definitely need to get microcode reloaded before interrupts are >>>>> enabled. >>>> >>>> Do you see any problems with simply moving microcode_resume_cpu(0) >>>> directly before the local_irq_restore(flags) call? (I'm not familiar >>>> with the code at all and (early) resume handling sounds like something >>>> which is easy to break in non obvious ways) >>> >>> Yes, there would be a problem: microcode_resume_cpu() >>> spin_lock()-s almost first thing, and this would break our >>> (simplistic) lock checking. Putting it also ahead of >>> spin_debug_enable() should work otoh. >>> >>> Once at it, cpufreq_add_cpu() should be moved ahead of the >>> enable_cpu label as well, as cpufreq_del_cpu() wasn't called >>> yet at the point of the only goto to that label. >> >> And I think console_end_sync() want to be moved earlier then >> as well. > > Where exactly? console_end_sync() seems to match the position of > console_start_sync(). The question isn't symmetry with the start_sync, but the fact that at the right log level (and on big systems) microcode updates can be quite verbose. We don't want all this output to go out in sync mode, I think, albeit then again doing the output in normal mode may mean some of it gets discarded (but personally I think that's acceptable). As to where exactly, the easiest seems to be to hand you a patch. Please give this a try. Of course none of this addresses a possible NMI or #MC occurring before the microcode loading. Jan x86: correct ordering of operations during S3 resume Microcode loading needs to happen before re-enabling interrupts, in case only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs. Otoh it doesn't need to happen at all when we didn't suspend in the first place. It needs to happen before spin_debug_enable() though, as it acquires a lock and hence would otherwise make common/spinlock.c:check_lock() unhappy. As micrcode loading can be pretty verbose, also make sure it only runs after console_end_sync(). cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu" path, which sits ahead of cpufreq_del_cpu(). Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> --- unstable.orig/xen/arch/x86/acpi/power.c +++ unstable/xen/arch/x86/acpi/power.c @@ -203,6 +203,7 @@ static int enter_state(u32 state) printk(XENLOG_ERR "Some devices failed to power down."); system_state = SYS_STATE_resume; device_power_up(error); + console_end_sync(); error = -EIO; goto done; } @@ -243,17 +244,19 @@ static int enter_state(u32 state) if ( (state == ACPI_STATE_S3) && error ) tboot_s3_error(error); + console_end_sync(); + + microcode_resume_cpu(0); + done: spin_debug_enable(); local_irq_restore(flags); - console_end_sync(); acpi_sleep_post(state); if ( hvm_cpu_up() ) BUG(); + cpufreq_add_cpu(0); enable_cpu: - cpufreq_add_cpu(0); - microcode_resume_cpu(0); rcu_barrier(); mtrr_aps_sync_begin(); enable_nonboot_cpus(); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() 2018-04-11 13:45 ` Jan Beulich @ 2018-04-11 20:14 ` Simon Gaiser 2018-04-11 20:14 ` [PATCH 2/2] x86: correct ordering of operations during S3 resume Simon Gaiser 2018-04-12 6:56 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Jan Beulich 0 siblings, 2 replies; 17+ messages in thread From: Simon Gaiser @ 2018-04-11 20:14 UTC (permalink / raw) To: xen-devel; +Cc: Simon Gaiser, Andrew Cooper, Jan Beulich Make it possible to distinguish between a failure to load a microcode update and a failure to find any matching microcode update by returning -ENOENT (instead of -EIO) in the later case. Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com> --- xen/arch/x86/microcode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c index 77c1efc97f..e9fafb1d14 100644 --- a/xen/arch/x86/microcode.c +++ b/xen/arch/x86/microcode.c @@ -248,7 +248,7 @@ int microcode_resume_cpu(unsigned int cpu) __microcode_fini_cpu(cpu); uci->cpu_sig = nsig; - err = -EIO; + err = -ENOENT; for_each_online_cpu ( cpu2 ) { uci = &per_cpu(ucode_cpu_info, cpu2); -- 2.16.2 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/2] x86: correct ordering of operations during S3 resume 2018-04-11 20:14 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Simon Gaiser @ 2018-04-11 20:14 ` Simon Gaiser 2018-04-11 20:21 ` [PATCH v2 " Simon Gaiser 2018-04-12 6:56 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Jan Beulich 1 sibling, 1 reply; 17+ messages in thread From: Simon Gaiser @ 2018-04-11 20:14 UTC (permalink / raw) To: xen-devel; +Cc: Simon Gaiser, Andrew Cooper, Jan Beulich Microcode loading needs to happen before re-enabling interrupts, in case only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs. Otoh it doesn't need to happen at all when we didn't suspend in the first place. It needs to happen before spin_debug_enable() though, as it acquires a lock and hence would otherwise make common/spinlock.c:check_lock() unhappy. As microcode loading can be pretty verbose, also make sure it only runs after console_end_sync(). cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu" path, which sits ahead of cpufreq_del_cpu(). Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> [Simon: Panic if microcode restore fails] Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com> --- xen/arch/x86/acpi/power.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c index 1e4e5680a7..0e00c198db 100644 --- a/xen/arch/x86/acpi/power.c +++ b/xen/arch/x86/acpi/power.c @@ -203,6 +203,7 @@ static int enter_state(u32 state) printk(XENLOG_ERR "Some devices failed to power down."); system_state = SYS_STATE_resume; device_power_up(error); + console_end_sync(); error = -EIO; goto done; } @@ -243,17 +244,21 @@ static int enter_state(u32 state) if ( (state == ACPI_STATE_S3) && error ) tboot_s3_error(error); + console_end_sync(); + + error = microcode_resume_cpu(0); + if (error && error != -ENOENT) + panic("Could not restore microcode on boot cpu (%d)", error); + done: spin_debug_enable(); local_irq_restore(flags); - console_end_sync(); acpi_sleep_post(state); if ( hvm_cpu_up() ) BUG(); + cpufreq_add_cpu(0); enable_cpu: - cpufreq_add_cpu(0); - microcode_resume_cpu(0); rcu_barrier(); mtrr_aps_sync_begin(); enable_nonboot_cpus(); -- 2.16.2 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v2 2/2] x86: correct ordering of operations during S3 resume 2018-04-11 20:14 ` [PATCH 2/2] x86: correct ordering of operations during S3 resume Simon Gaiser @ 2018-04-11 20:21 ` Simon Gaiser [not found] ` <5ACE6EA10200005203786DDD@prv1-mh.provo.novell.com> 0 siblings, 1 reply; 17+ messages in thread From: Simon Gaiser @ 2018-04-11 20:21 UTC (permalink / raw) To: xen-devel; +Cc: Simon Gaiser, Andrew Cooper, Jan Beulich From: Jan Beulich <JBeulich@suse.com> Microcode loading needs to happen before re-enabling interrupts, in case only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs. Otoh it doesn't need to happen at all when we didn't suspend in the first place. It needs to happen before spin_debug_enable() though, as it acquires a lock and hence would otherwise make common/spinlock.c:check_lock() unhappy. As micrcode loading can be pretty verbose, also make sure it only runs after console_end_sync(). cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu" path, which sits ahead of cpufreq_del_cpu(). Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> [Simon: Panic if microcode restore fails] Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com> --- Sorry didn't want to rewrite the author. No other change in v2. xen/arch/x86/acpi/power.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c index 1e4e5680a7..0e00c198db 100644 --- a/xen/arch/x86/acpi/power.c +++ b/xen/arch/x86/acpi/power.c @@ -203,6 +203,7 @@ static int enter_state(u32 state) printk(XENLOG_ERR "Some devices failed to power down."); system_state = SYS_STATE_resume; device_power_up(error); + console_end_sync(); error = -EIO; goto done; } @@ -243,17 +244,21 @@ static int enter_state(u32 state) if ( (state == ACPI_STATE_S3) && error ) tboot_s3_error(error); + console_end_sync(); + + error = microcode_resume_cpu(0); + if (error && error != -ENOENT) + panic("Could not restore microcode on boot cpu (%d)", error); + done: spin_debug_enable(); local_irq_restore(flags); - console_end_sync(); acpi_sleep_post(state); if ( hvm_cpu_up() ) BUG(); + cpufreq_add_cpu(0); enable_cpu: - cpufreq_add_cpu(0); - microcode_resume_cpu(0); rcu_barrier(); mtrr_aps_sync_begin(); enable_nonboot_cpus(); -- 2.16.2 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply related [flat|nested] 17+ messages in thread
[parent not found: <5ACE6EA10200005203786DDD@prv1-mh.provo.novell.com>]
* Re: [PATCH v2 2/2] x86: correct ordering of operations during S3 resume [not found] ` <5ACE6EA10200005203786DDD@prv1-mh.provo.novell.com> @ 2018-04-12 7:12 ` Jan Beulich 0 siblings, 0 replies; 17+ messages in thread From: Jan Beulich @ 2018-04-12 7:12 UTC (permalink / raw) To: Andrew Cooper, Simon Gaiser; +Cc: xen-devel >>> On 11.04.18 at 22:21, <simon@invisiblethingslab.com> wrote: > @@ -243,17 +244,21 @@ static int enter_state(u32 state) > if ( (state == ACPI_STATE_S3) && error ) > tboot_s3_error(error); > > + console_end_sync(); > + > + error = microcode_resume_cpu(0); > + if (error && error != -ENOENT) Missing blanks. > + panic("Could not restore microcode on boot cpu (%d)", error); Andrew, you've suggested the panic() here, but I'm not convinced this is strictly necessary. For most ucode updates we're fine without, and could accept them being re-applied post-resume. That'll make a lot of false positive panics. Furthermore, in case there are really mixed steppings, receiving -ENOENT here still means we're going to die soon after. I.e. (rare) false negatives are possible as well. Instead what I think we want is a feature comparison after resume: Any feature we (or any alive guests) have in active use needs to be available. That is, host_cpuid_policy must not have changed (the two {hvm,pv}_max_cpuid_policy are only derived, and hence won't need separate checking afaict; without that checking host_cpuid_policy might be too strict, but then again compiling a list of features we actually use would be prone to go stale very quickly once use of new features is introduced, unless we tied this into cpu_has_* checks, implying that any such check means the feature is used if available, yet that would in turn have issues in particular with the uses in the insn emulator). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() 2018-04-11 20:14 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Simon Gaiser 2018-04-11 20:14 ` [PATCH 2/2] x86: correct ordering of operations during S3 resume Simon Gaiser @ 2018-04-12 6:56 ` Jan Beulich 1 sibling, 0 replies; 17+ messages in thread From: Jan Beulich @ 2018-04-12 6:56 UTC (permalink / raw) To: Simon Gaiser; +Cc: Andrew Cooper, xen-devel >>> On 11.04.18 at 22:14, <simon@invisiblethingslab.com> wrote: > Make it possible to distinguish between a failure to load a microcode > update and a failure to find any matching microcode update by returning > -ENOENT (instead of -EIO) in the later case. > > Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com> > --- > xen/arch/x86/microcode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c > index 77c1efc97f..e9fafb1d14 100644 > --- a/xen/arch/x86/microcode.c > +++ b/xen/arch/x86/microcode.c > @@ -248,7 +248,7 @@ int microcode_resume_cpu(unsigned int cpu) > __microcode_fini_cpu(cpu); > uci->cpu_sig = nsig; > > - err = -EIO; > + err = -ENOENT; > for_each_online_cpu ( cpu2 ) > { > uci = &per_cpu(ucode_cpu_info, cpu2); I think this should be part of the other patch, but I'll reply there in a minute. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 12:01 ` Simon Gaiser 2018-04-11 12:11 ` Jan Beulich @ 2018-04-11 12:12 ` Andrew Cooper 2018-04-11 14:49 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 17+ messages in thread From: Andrew Cooper @ 2018-04-11 12:12 UTC (permalink / raw) To: Simon Gaiser, xen-devel; +Cc: Jan Beulich On 11/04/18 13:01, Simon Gaiser wrote: > Andrew Cooper: >> On 11/04/18 12:48, Simon Gaiser wrote: >>> Hi, >>> >>> when I use early microcode loading with the microcode update with the >>> BTI mitigations, resuming from suspend to RAM is broken. >>> >>> Based on added logging to enter_state() (from power.c) it doesn't >>> survive the local_irq_restore(flags) call (at least a printk() after the >>> call doesn't output anything on the serial console). >>> >>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>> is only loaded later. >>> >>> If I simply move the microcode_resume_cpu(0) directly before the >>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>> if this has unintended consequences. >>> >>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>> microcode patches from staging-4.8. AFAICS there are no commits which >>> changes the affected code or other commits which sound relevant so this >>> probably affected also all the newer branches. >> S3 support is a very unloved area of the hypervisor. >> >> Yes - we definitely need to get microcode reloaded before interrupts are >> enabled. > Do you see any problems with simply moving microcode_resume_cpu(0) > directly before the local_irq_restore(flags) call? (I'm not familiar > with the code at all and (early) resume handling sounds like something > which is easy to break in non obvious ways) Judging by what is going on, it wants to be between tboot_s3_error() and the done label. We only need to restore microcode if we successfully went into S3. The done and enable_cpu labels are only used by paths which don't need to restore microcode. OTOH, you should check the return value and panic if restoration failed. As you've seen, the system won't survive trying to blindly continue resuming. > >> That said, I would have expected a backtrace complaining about >> a GP fault if we had hit the use of IBRS/IBPB before the microcode was >> reloaded. > Yeah, not sure what's happening here. I don't get any output from after > local_irq_restore(flags). If you have some ideas for more debug output I > can easily test it. In hindsight, I am. We take a #GP fault because of a bad MSR, and at the head of the exception handler try to use the same bad MSR. It will repeatedly fault until hitting a guard page (or other read-only page), at which point we take a double fault, and suffer a #GP yet again. Taking a #DF will reset the stack to a moderately sane value, and the system will livelock taking faults. This is an unfortunate consequence of having $MAGIC in the exception handlers. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 12:12 ` Resume from suspend to RAM broken when using early microcode updates Andrew Cooper @ 2018-04-11 14:49 ` Konrad Rzeszutek Wilk 2018-04-11 15:05 ` Andrew Cooper 0 siblings, 1 reply; 17+ messages in thread From: Konrad Rzeszutek Wilk @ 2018-04-11 14:49 UTC (permalink / raw) To: Andrew Cooper; +Cc: Simon Gaiser, xen-devel, Jan Beulich On Wed, Apr 11, 2018 at 01:12:51PM +0100, Andrew Cooper wrote: > On 11/04/18 13:01, Simon Gaiser wrote: > > Andrew Cooper: > >> On 11/04/18 12:48, Simon Gaiser wrote: > >>> Hi, > >>> > >>> when I use early microcode loading with the microcode update with the > >>> BTI mitigations, resuming from suspend to RAM is broken. > >>> > >>> Based on added logging to enter_state() (from power.c) it doesn't > >>> survive the local_irq_restore(flags) call (at least a printk() after the > >>> call doesn't output anything on the serial console). > >>> > >>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode > >>> is only loaded later. > >>> > >>> If I simply move the microcode_resume_cpu(0) directly before the > >>> local_irq_restore(flags) everything seems to work fine. But I'm not sure > >>> if this has unintended consequences. > >>> > >>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and > >>> microcode patches from staging-4.8. AFAICS there are no commits which > >>> changes the affected code or other commits which sound relevant so this > >>> probably affected also all the newer branches. > >> S3 support is a very unloved area of the hypervisor. > >> > >> Yes - we definitely need to get microcode reloaded before interrupts are > >> enabled. > > Do you see any problems with simply moving microcode_resume_cpu(0) > > directly before the local_irq_restore(flags) call? (I'm not familiar > > with the code at all and (early) resume handling sounds like something > > which is easy to break in non obvious ways) > > Judging by what is going on, it wants to be between tboot_s3_error() and > the done label. > > We only need to restore microcode if we successfully went into S3. The > done and enable_cpu labels are only used by paths which don't need to > restore microcode. > > OTOH, you should check the return value and panic if restoration > failed. As you've seen, the system won't survive trying to blindly > continue resuming. > > > > >> That said, I would have expected a backtrace complaining about > >> a GP fault if we had hit the use of IBRS/IBPB before the microcode was > >> reloaded. > > Yeah, not sure what's happening here. I don't get any output from after > > local_irq_restore(flags). If you have some ideas for more debug output I > > can easily test it. > > In hindsight, I am. We take a #GP fault because of a bad MSR, and at > the head of the exception handler try to use the same bad MSR. It will > repeatedly fault until hitting a guard page (or other read-only page), > at which point we take a double fault, and suffer a #GP yet again. > Taking a #DF will reset the stack to a moderately sane value, and the > system will livelock taking faults. > > This is an unfortunate consequence of having $MAGIC in the exception > handlers. We can just disable IBRS before going to sleep? > > ~Andrew > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xenproject.org > https://lists.xenproject.org/mailman/listinfo/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 14:49 ` Konrad Rzeszutek Wilk @ 2018-04-11 15:05 ` Andrew Cooper 2018-04-11 15:32 ` Jan Beulich 0 siblings, 1 reply; 17+ messages in thread From: Andrew Cooper @ 2018-04-11 15:05 UTC (permalink / raw) To: Konrad Rzeszutek Wilk; +Cc: Simon Gaiser, xen-devel, Jan Beulich On 11/04/18 15:49, Konrad Rzeszutek Wilk wrote: > On Wed, Apr 11, 2018 at 01:12:51PM +0100, Andrew Cooper wrote: >> On 11/04/18 13:01, Simon Gaiser wrote: >>> Andrew Cooper: >>>> On 11/04/18 12:48, Simon Gaiser wrote: >>>>> Hi, >>>>> >>>>> when I use early microcode loading with the microcode update with the >>>>> BTI mitigations, resuming from suspend to RAM is broken. >>>>> >>>>> Based on added logging to enter_state() (from power.c) it doesn't >>>>> survive the local_irq_restore(flags) call (at least a printk() after the >>>>> call doesn't output anything on the serial console). >>>>> >>>>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>>>> is only loaded later. >>>>> >>>>> If I simply move the microcode_resume_cpu(0) directly before the >>>>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>>>> if this has unintended consequences. >>>>> >>>>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>>>> microcode patches from staging-4.8. AFAICS there are no commits which >>>>> changes the affected code or other commits which sound relevant so this >>>>> probably affected also all the newer branches. >>>> S3 support is a very unloved area of the hypervisor. >>>> >>>> Yes - we definitely need to get microcode reloaded before interrupts are >>>> enabled. >>> Do you see any problems with simply moving microcode_resume_cpu(0) >>> directly before the local_irq_restore(flags) call? (I'm not familiar >>> with the code at all and (early) resume handling sounds like something >>> which is easy to break in non obvious ways) >> Judging by what is going on, it wants to be between tboot_s3_error() and >> the done label. >> >> We only need to restore microcode if we successfully went into S3. The >> done and enable_cpu labels are only used by paths which don't need to >> restore microcode. >> >> OTOH, you should check the return value and panic if restoration >> failed. As you've seen, the system won't survive trying to blindly >> continue resuming. >> >>>> That said, I would have expected a backtrace complaining about >>>> a GP fault if we had hit the use of IBRS/IBPB before the microcode was >>>> reloaded. >>> Yeah, not sure what's happening here. I don't get any output from after >>> local_irq_restore(flags). If you have some ideas for more debug output I >>> can easily test it. >> In hindsight, I am. We take a #GP fault because of a bad MSR, and at >> the head of the exception handler try to use the same bad MSR. It will >> repeatedly fault until hitting a guard page (or other read-only page), >> at which point we take a double fault, and suffer a #GP yet again. >> Taking a #DF will reset the stack to a moderately sane value, and the >> system will livelock taking faults. >> >> This is an unfortunate consequence of having $MAGIC in the exception >> handlers. > We can just disable IBRS before going to sleep? The problem is the use of MSR_SPEC_CTRL/MSR_PRED_CMD before we've reloaded the microcode which causes those MSRs to magic themselves into existence. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 15:05 ` Andrew Cooper @ 2018-04-11 15:32 ` Jan Beulich 0 siblings, 0 replies; 17+ messages in thread From: Jan Beulich @ 2018-04-11 15:32 UTC (permalink / raw) To: Andrew Cooper, Konrad Rzeszutek Wilk; +Cc: Simon Gaiser, xen-devel >>> On 11.04.18 at 17:05, <andrew.cooper3@citrix.com> wrote: > On 11/04/18 15:49, Konrad Rzeszutek Wilk wrote: >> On Wed, Apr 11, 2018 at 01:12:51PM +0100, Andrew Cooper wrote: >>> On 11/04/18 13:01, Simon Gaiser wrote: >>>> Andrew Cooper: >>>>> On 11/04/18 12:48, Simon Gaiser wrote: >>>>>> Hi, >>>>>> >>>>>> when I use early microcode loading with the microcode update with the >>>>>> BTI mitigations, resuming from suspend to RAM is broken. >>>>>> >>>>>> Based on added logging to enter_state() (from power.c) it doesn't >>>>>> survive the local_irq_restore(flags) call (at least a printk() after the >>>>>> call doesn't output anything on the serial console). >>>>>> >>>>>> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >>>>>> is only loaded later. >>>>>> >>>>>> If I simply move the microcode_resume_cpu(0) directly before the >>>>>> local_irq_restore(flags) everything seems to work fine. But I'm not sure >>>>>> if this has unintended consequences. >>>>>> >>>>>> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >>>>>> microcode patches from staging-4.8. AFAICS there are no commits which >>>>>> changes the affected code or other commits which sound relevant so this >>>>>> probably affected also all the newer branches. >>>>> S3 support is a very unloved area of the hypervisor. >>>>> >>>>> Yes - we definitely need to get microcode reloaded before interrupts are >>>>> enabled. >>>> Do you see any problems with simply moving microcode_resume_cpu(0) >>>> directly before the local_irq_restore(flags) call? (I'm not familiar >>>> with the code at all and (early) resume handling sounds like something >>>> which is easy to break in non obvious ways) >>> Judging by what is going on, it wants to be between tboot_s3_error() and >>> the done label. >>> >>> We only need to restore microcode if we successfully went into S3. The >>> done and enable_cpu labels are only used by paths which don't need to >>> restore microcode. >>> >>> OTOH, you should check the return value and panic if restoration >>> failed. As you've seen, the system won't survive trying to blindly >>> continue resuming. >>> >>>>> That said, I would have expected a backtrace complaining about >>>>> a GP fault if we had hit the use of IBRS/IBPB before the microcode was >>>>> reloaded. >>>> Yeah, not sure what's happening here. I don't get any output from after >>>> local_irq_restore(flags). If you have some ideas for more debug output I >>>> can easily test it. >>> In hindsight, I am. We take a #GP fault because of a bad MSR, and at >>> the head of the exception handler try to use the same bad MSR. It will >>> repeatedly fault until hitting a guard page (or other read-only page), >>> at which point we take a double fault, and suffer a #GP yet again. >>> Taking a #DF will reset the stack to a moderately sane value, and the >>> system will livelock taking faults. >>> >>> This is an unfortunate consequence of having $MAGIC in the exception >>> handlers. >> We can just disable IBRS before going to sleep? > > The problem is the use of MSR_SPEC_CTRL/MSR_PRED_CMD before we've > reloaded the microcode which causes those MSRs to magic themselves into > existence. Well, Konrad certainly has a point: Just like we could make SPEC_CTRL_ENTRY_FROM_INTR_IST skip the WRMSR by clearing BTI_IST_WRMSR, we could add a conditional branch to DO_SPEC_CTRL_ENTRY's maybexen case. The former would also allow to deal with an early (after resume) NMI or #MC. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Resume from suspend to RAM broken when using early microcode updates 2018-04-11 11:51 ` Andrew Cooper 2018-04-11 12:01 ` Simon Gaiser @ 2018-04-11 12:04 ` Jan Beulich 1 sibling, 0 replies; 17+ messages in thread From: Jan Beulich @ 2018-04-11 12:04 UTC (permalink / raw) To: Andrew Cooper; +Cc: Simon Gaiser, xen-devel >>> On 11.04.18 at 13:51, <andrew.cooper3@citrix.com> wrote: > On 11/04/18 12:48, Simon Gaiser wrote: >> Hi, >> >> when I use early microcode loading with the microcode update with the >> BTI mitigations, resuming from suspend to RAM is broken. >> >> Based on added logging to enter_state() (from power.c) it doesn't >> survive the local_irq_restore(flags) call (at least a printk() after the >> call doesn't output anything on the serial console). >> >> I guess that some irq handler tries to use IBRS/IBPB. But the microcode >> is only loaded later. >> >> If I simply move the microcode_resume_cpu(0) directly before the >> local_irq_restore(flags) everything seems to work fine. But I'm not sure >> if this has unintended consequences. >> >> I tested the above with Xen 4.8.3 from Qubes which includes the BTI and >> microcode patches from staging-4.8. AFAICS there are no commits which >> changes the affected code or other commits which sound relevant so this >> probably affected also all the newer branches. > > S3 support is a very unloved area of the hypervisor. > > Yes - we definitely need to get microcode reloaded before interrupts are > enabled. That said, I would have expected a backtrace complaining about > a GP fault if we had hit the use of IBRS/IBPB before the microcode was > reloaded. Wouldn't the #GP handling re-raise a #GP for the same reason, until hitting the end of the stack, making it a #DF, the handler of which would yet again have the same issue? This would end with an infinite loop between #GP and #DF handlers (never triple faulting), and no output. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xenproject.org https://lists.xenproject.org/mailman/listinfo/xen-devel ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-04-12 7:12 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-04-11 11:48 Resume from suspend to RAM broken when using early microcode updates Simon Gaiser 2018-04-11 11:51 ` Andrew Cooper 2018-04-11 12:01 ` Simon Gaiser 2018-04-11 12:11 ` Jan Beulich 2018-04-11 12:17 ` Jan Beulich 2018-04-11 12:46 ` Simon Gaiser 2018-04-11 13:45 ` Jan Beulich 2018-04-11 20:14 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Simon Gaiser 2018-04-11 20:14 ` [PATCH 2/2] x86: correct ordering of operations during S3 resume Simon Gaiser 2018-04-11 20:21 ` [PATCH v2 " Simon Gaiser [not found] ` <5ACE6EA10200005203786DDD@prv1-mh.provo.novell.com> 2018-04-12 7:12 ` Jan Beulich 2018-04-12 6:56 ` [PATCH 1/2] x86/microcode: Indicate "not found" in rc of microcode_resume_cpu() Jan Beulich 2018-04-11 12:12 ` Resume from suspend to RAM broken when using early microcode updates Andrew Cooper 2018-04-11 14:49 ` Konrad Rzeszutek Wilk 2018-04-11 15:05 ` Andrew Cooper 2018-04-11 15:32 ` Jan Beulich 2018-04-11 12:04 ` Jan Beulich
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.