I observe this with guest 3.13 and host 3.13 when running with -cpu host on my laptop: [ 0.043000] Call Trace: [ 0.043000] [<ffffffff81d0e873>] init_hw_perf_events+0x33/0x5cd [ 0.043000] [<ffffffff81d0e840>] ? check_bugs+0x40/0x40 [ 0.043000] [<ffffffff8100030a>] do_one_initcall+0x13a/0x190 [ 0.043000] [<ffffffff81d15133>] ? native_smp_prepare_cpus+0x285/0x3ee [ 0.043000] [<ffffffff81d068da>] kernel_init_freeable+0x136/0x298 [ 0.043000] [<ffffffff816834e0>] ? rest_init+0x80/0x80 [ 0.043000] [<ffffffff816834ee>] kernel_init+0xe/0x130 [ 0.043000] [<ffffffff8169422c>] ret_from_fork+0x7c/0xb0 [ 0.043000] [<ffffffff816834e0>] ? rest_init+0x80/0x80 [ 0.043000] Code: 0f 46 c2 41 83 e8 01 89 05 63 4c fd ff 7e 2e 44 89 d2 b8 03 00 00 00 b9 45 03 00 00 83 e2 1f 83 fa 02 0f 4f c2 89 05 6d 4b fd ff <0f> 32 48 c1 e2 20 89 c0 48 09 c2 48 89 15 0b 4c fd ff e8 c6 d3 [ 0.043000] RIP [<ffffffff81d0f8c3>] intel_pmu_init+0x208/0x95a [ 0.043000] RSP <ffff88003f25fe18> [ 0.043012] ---[ end trace 9f1576f03a80bfa0 ]--- [ 0.044018] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b -cpu kvm64 works fine. Reproduces with upstream qemu a75143eda2ddf581b51e96c000974bcdfe2cbd10, as well as with qemu-kvm from Fedora 19. Tried recent git from Linus - it still has this problem. -- MST
On Sun, Feb 02, 2014 at 10:59:30PM +0200, Michael S. Tsirkin wrote: > I observe this with guest 3.13 and host 3.13 > when running with -cpu host on my laptop: > > [ 0.043000] Call Trace: > [ 0.043000] [<ffffffff81d0e873>] init_hw_perf_events+0x33/0x5cd > [ 0.043000] [<ffffffff81d0e840>] ? check_bugs+0x40/0x40 > [ 0.043000] [<ffffffff8100030a>] do_one_initcall+0x13a/0x190 > [ 0.043000] [<ffffffff81d15133>] ? > native_smp_prepare_cpus+0x285/0x3ee > [ 0.043000] [<ffffffff81d068da>] kernel_init_freeable+0x136/0x298 > [ 0.043000] [<ffffffff816834e0>] ? rest_init+0x80/0x80 > [ 0.043000] [<ffffffff816834ee>] kernel_init+0xe/0x130 > [ 0.043000] [<ffffffff8169422c>] ret_from_fork+0x7c/0xb0 > [ 0.043000] [<ffffffff816834e0>] ? rest_init+0x80/0x80 > [ 0.043000] Code: 0f 46 c2 41 83 e8 01 89 05 63 4c fd ff 7e 2e 44 89 > d2 b8 03 00 00 00 b9 45 03 00 00 83 e2 1f 83 fa 02 0f 4f c2 89 05 6d 4b > fd ff <0f> 32 48 c1 e2 20 89 c0 48 09 c2 48 89 15 0b 4c fd ff e8 c6 d3 0: 0f 46 c2 cmovbe %edx,%eax 3: 41 83 e8 01 sub $0x1,%r8d 7: 89 05 63 4c fd ff mov %eax,-0x2b39d(%rip) # 0xfffffffffffd4c70 d: 7e 2e jle 0x3d f: 44 89 d2 mov %r10d,%edx 12: b8 03 00 00 00 mov $0x3,%eax 17: b9 45 03 00 00 mov $0x345,%ecx 1c: 83 e2 1f and $0x1f,%edx 1f: 83 fa 02 cmp $0x2,%edx 22: 0f 4f c2 cmovg %edx,%eax 25: 89 05 6d 4b fd ff mov %eax,-0x2b493(%rip) # 0xfffffffffffd4b98 2b:* 0f 32 rdmsr <-- trapping instruction 2d: 48 c1 e2 20 shl $0x20,%rdx 31: 89 c0 mov %eax,%eax 33: 48 09 c2 or %rax,%rdx 36: 48 89 15 0b 4c fd ff mov %rdx,-0x2b3f5(%rip) # 0xfffffffffffd4c48 3d: e8 .byte 0xe8 3e: c6 (bad) 3f: d3 .byte 0xd3 Linux seems to be trying to read IA32_PERF_CAPABILITIES without checking the PDCM flag (CPUID[1].ECX[15]). I can't see why this wasn't crashing before, though. That code seems to be old. * v2 and above have a perf capabilities MSR */ if (version > 1) { u64 capabilities; rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities); x86_pmu.intel_cap.capabilities = capabilities; } Where does the "v2 and above have a perf capabilities MSR" claim in the code come from? > [ 0.043000] RIP [<ffffffff81d0f8c3>] intel_pmu_init+0x208/0x95a > [ 0.043000] RSP <ffff88003f25fe18> > [ 0.043012] ---[ end trace 9f1576f03a80bfa0 ]--- > [ 0.044018] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x0000000b > > -cpu kvm64 works fine. > > Reproduces with upstream qemu a75143eda2ddf581b51e96c000974bcdfe2cbd10, > as well as with qemu-kvm from Fedora 19. > > Tried recent git from Linus - it still has this problem. > > -- > MST -- Eduardo
On Mon, Feb 03, 2014 at 10:58:28AM -0200, Eduardo Habkost wrote: > Linux seems to be trying to read IA32_PERF_CAPABILITIES without checking the > PDCM flag (CPUID[1].ECX[15]). > > I can't see why this wasn't crashing before, though. That code seems to be old. > > * v2 and above have a perf capabilities MSR > */ > if (version > 1) { > u64 capabilities; > > rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities); > x86_pmu.intel_cap.capabilities = capabilities; > } > > Where does the "v2 and above have a perf capabilities MSR" claim in the code > come from? Dunno, I'm pretty sure I wrote that code but I've no idea, other than that's what actual hardware does. I suppose the below would be correct. diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 0fa4f242f050..9407f61cdc1c 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2310,10 +2310,7 @@ __init int intel_pmu_init(void) if (version > 1) x86_pmu.num_counters_fixed = max((int)edx.split.num_counters_fixed, 3); - /* - * v2 and above have a perf capabilities MSR - */ - if (version > 1) { + if (boot_cpu_has(X86_FEATURE_PDCM)) { u64 capabilities; rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
On Mon, Feb 03, 2014 at 10:58:28AM -0200, Eduardo Habkost wrote: > On Sun, Feb 02, 2014 at 10:59:30PM +0200, Michael S. Tsirkin wrote: > > I observe this with guest 3.13 and host 3.13 > > when running with -cpu host on my laptop: > > > > [ 0.043000] Call Trace: > > [ 0.043000] [<ffffffff81d0e873>] init_hw_perf_events+0x33/0x5cd > > [ 0.043000] [<ffffffff81d0e840>] ? check_bugs+0x40/0x40 > > [ 0.043000] [<ffffffff8100030a>] do_one_initcall+0x13a/0x190 > > [ 0.043000] [<ffffffff81d15133>] ? > > native_smp_prepare_cpus+0x285/0x3ee > > [ 0.043000] [<ffffffff81d068da>] kernel_init_freeable+0x136/0x298 > > [ 0.043000] [<ffffffff816834e0>] ? rest_init+0x80/0x80 > > [ 0.043000] [<ffffffff816834ee>] kernel_init+0xe/0x130 > > [ 0.043000] [<ffffffff8169422c>] ret_from_fork+0x7c/0xb0 > > [ 0.043000] [<ffffffff816834e0>] ? rest_init+0x80/0x80 > > [ 0.043000] Code: 0f 46 c2 41 83 e8 01 89 05 63 4c fd ff 7e 2e 44 89 > > d2 b8 03 00 00 00 b9 45 03 00 00 83 e2 1f 83 fa 02 0f 4f c2 89 05 6d 4b > > fd ff <0f> 32 48 c1 e2 20 89 c0 48 09 c2 48 89 15 0b 4c fd ff e8 c6 d3 > > 0: 0f 46 c2 cmovbe %edx,%eax > 3: 41 83 e8 01 sub $0x1,%r8d > 7: 89 05 63 4c fd ff mov %eax,-0x2b39d(%rip) # 0xfffffffffffd4c70 > d: 7e 2e jle 0x3d > f: 44 89 d2 mov %r10d,%edx > 12: b8 03 00 00 00 mov $0x3,%eax > 17: b9 45 03 00 00 mov $0x345,%ecx > 1c: 83 e2 1f and $0x1f,%edx > 1f: 83 fa 02 cmp $0x2,%edx > 22: 0f 4f c2 cmovg %edx,%eax > 25: 89 05 6d 4b fd ff mov %eax,-0x2b493(%rip) # 0xfffffffffffd4b98 > 2b:* 0f 32 rdmsr <-- trapping instruction > 2d: 48 c1 e2 20 shl $0x20,%rdx > 31: 89 c0 mov %eax,%eax > 33: 48 09 c2 or %rax,%rdx > 36: 48 89 15 0b 4c fd ff mov %rdx,-0x2b3f5(%rip) # 0xfffffffffffd4c48 > 3d: e8 .byte 0xe8 > 3e: c6 (bad) > 3f: d3 .byte 0xd3 > > Linux seems to be trying to read IA32_PERF_CAPABILITIES without checking the > PDCM flag (CPUID[1].ECX[15]). > > I can't see why this wasn't crashing before, though. That code seems to be old. > > * v2 and above have a perf capabilities MSR > */ > if (version > 1) { > u64 capabilities; > > rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities); > x86_pmu.intel_cap.capabilities = capabilities; > } > > Where does the "v2 and above have a perf capabilities MSR" claim in the code > come from? But why doesn't it crash on baremetal? Probably baremetal simply returns 0 or something. Let me try .. > > > > [ 0.043000] RIP [<ffffffff81d0f8c3>] intel_pmu_init+0x208/0x95a > > [ 0.043000] RSP <ffff88003f25fe18> > > [ 0.043012] ---[ end trace 9f1576f03a80bfa0 ]--- > > [ 0.044018] Kernel panic - not syncing: Attempted to kill init! > > exitcode=0x0000000b > > > > -cpu kvm64 works fine. > > > > Reproduces with upstream qemu a75143eda2ddf581b51e96c000974bcdfe2cbd10, > > as well as with qemu-kvm from Fedora 19. > > > > Tried recent git from Linus - it still has this problem. > > > > -- > > MST > > -- > Eduardo
Il 03/02/2014 15:06, Michael S. Tsirkin ha scritto:
>> Linux seems to be trying to read IA32_PERF_CAPABILITIES without checking the
>> PDCM flag (CPUID[1].ECX[15]).
>>
>> I can't see why this wasn't crashing before, though. That code seems to be old.
>>
>> * v2 and above have a perf capabilities MSR
>> */
>> if (version > 1) {
>> u64 capabilities;
>>
>> rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
>> x86_pmu.intel_cap.capabilities = capabilities;
>> }
>>
>> Where does the "v2 and above have a perf capabilities MSR" claim in the code
>> come from?
>
>
> But why doesn't it crash on baremetal?
> Probably baremetal simply returns 0 or something.
> Let me try ..
Because KVM doesn't implement the MSR, but your baremetal likely does.
Paolo
On Mon, Feb 03, 2014 at 04:06:01PM +0200, Michael S. Tsirkin wrote: > On Mon, Feb 03, 2014 at 10:58:28AM -0200, Eduardo Habkost wrote: > > Where does the "v2 and above have a perf capabilities MSR" claim in the code > > come from? > > > But why doesn't it crash on baremetal? > Probably baremetal simply returns 0 or something. > Let me try .. The claim "v2 and above have FEATURE_PDCM" is in fact true for real hardware. If it didn't the rdmsr() would have generated an exception and we'd have crashed just like your virtual thingy did.
Il 03/02/2014 15:12, Peter Zijlstra ha scritto:
>> > But why doesn't it crash on baremetal?
>> > Probably baremetal simply returns 0 or something.
>> > Let me try ..
> The claim "v2 and above have FEATURE_PDCM" is in fact true for real
> hardware.
>
> If it didn't the rdmsr() would have generated an exception and we'd have
> crashed just like your virtual thingy did.
Right, and the virt thingy has no PEBS, so there is no correct value
that we could return from the MSR. That's why the CPUID bit is zero.
A strange game. The only winning move is not to play. How about a nice
game of chess?
Paolo
On Mon, Feb 03, 2014 at 03:19:18PM +0100, Paolo Bonzini wrote:
> Il 03/02/2014 15:12, Peter Zijlstra ha scritto:
> >>> But why doesn't it crash on baremetal?
> >>> Probably baremetal simply returns 0 or something.
> >>> Let me try ..
> >The claim "v2 and above have FEATURE_PDCM" is in fact true for real
> >hardware.
> >
> >If it didn't the rdmsr() would have generated an exception and we'd have
> >crashed just like your virtual thingy did.
>
> Right, and the virt thingy has no PEBS, so there is no correct value that we
> could return from the MSR. That's why the CPUID bit is zero.
There's more than PEBS in there, there's also the LBR format (which you
obviously also don't have) and the full_width_write bit, which you also
don't have.
Returning 0 is a safe value. Seeing you don't have LBR, we don't look at
the LBR format fields, seeing you don't have PEBS, we don't look at
those fields either.
We don't appear to use the SMM_FREEZE bit at all, and 0 is in fact the
right value for full_width_write, since you lack the MSRs to support
that.
Anyway, its easy for me to make future kernels do the right PDCM test,
probably easy to backport too (should apply with minimal trouble back a
fair number of releases).
You can also implement the MSR to simply return 0, which is a safe
value.
On Mon, Feb 03, 2014 at 03:26:42PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 03, 2014 at 03:19:18PM +0100, Paolo Bonzini wrote:
> > Il 03/02/2014 15:12, Peter Zijlstra ha scritto:
> > >>> But why doesn't it crash on baremetal?
> > >>> Probably baremetal simply returns 0 or something.
> > >>> Let me try ..
> > >The claim "v2 and above have FEATURE_PDCM" is in fact true for real
> > >hardware.
> > >
> > >If it didn't the rdmsr() would have generated an exception and we'd have
> > >crashed just like your virtual thingy did.
> >
> > Right, and the virt thingy has no PEBS, so there is no correct value that we
> > could return from the MSR. That's why the CPUID bit is zero.
>
> There's more than PEBS in there, there's also the LBR format (which you
> obviously also don't have)
On that.. LBR is purely model based, what do you guys do with those
MSRs?
On Mon, Feb 03, 2014 at 03:07:33PM +0100, Paolo Bonzini wrote:
> Il 03/02/2014 15:06, Michael S. Tsirkin ha scritto:
> >>Linux seems to be trying to read IA32_PERF_CAPABILITIES without checking the
> >>PDCM flag (CPUID[1].ECX[15]).
> >>
> >>I can't see why this wasn't crashing before, though. That code seems to be old.
> >>
> >> * v2 and above have a perf capabilities MSR
> >> */
> >> if (version > 1) {
> >> u64 capabilities;
> >>
> >> rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
> >> x86_pmu.intel_cap.capabilities = capabilities;
> >> }
> >>
> >>Where does the "v2 and above have a perf capabilities MSR" claim in the code
> >>come from?
> >
> >
> >But why doesn't it crash on baremetal?
> >Probably baremetal simply returns 0 or something.
> >Let me try ..
>
> Because KVM doesn't implement the MSR, but your baremetal likely does.
>
> Paolo
Yep. I get 31c3 on bare-metal.
So I suppose the claim is actually true, and ideally kvm should
emulate this instead of crashing guest.
--
MST
On Mon, Feb 03, 2014 at 03:26:42PM +0100, Peter Zijlstra wrote: > On Mon, Feb 03, 2014 at 03:19:18PM +0100, Paolo Bonzini wrote: > > Il 03/02/2014 15:12, Peter Zijlstra ha scritto: > > >>> But why doesn't it crash on baremetal? > > >>> Probably baremetal simply returns 0 or something. > > >>> Let me try .. > > >The claim "v2 and above have FEATURE_PDCM" is in fact true for real > > >hardware. > > > > > >If it didn't the rdmsr() would have generated an exception and we'd have > > >crashed just like your virtual thingy did. > > > > Right, and the virt thingy has no PEBS, so there is no correct value that we > > could return from the MSR. That's why the CPUID bit is zero. > > There's more than PEBS in there, there's also the LBR format (which you > obviously also don't have) and the full_width_write bit, which you also > don't have. > > Returning 0 is a safe value. Seeing you don't have LBR, we don't look at > the LBR format fields, seeing you don't have PEBS, we don't look at > those fields either. > > We don't appear to use the SMM_FREEZE bit at all, and 0 is in fact the > right value for full_width_write, since you lack the MSRs to support > that. > > Anyway, its easy for me to make future kernels do the right PDCM test, > probably easy to backport too (should apply with minimal trouble back a > fair number of releases). > > You can also implement the MSR to simply return 0, which is a safe > value. OK, I'm testing the following now: ---> Subject: [PATCH] kvm: emulate MSR_IA32_PERF_CAPABILITIES guests expect that this does not crash if version > 1. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> --- arch/x86/kvm/x86.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5d004da..eaf5016 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2407,6 +2407,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata) /* CPU multiplier */ data |= (((uint64_t)4ULL) << 40); break; + case MSR_IA32_PERF_CAPABILITIES: + data = 0; + break; case MSR_EFER: data = vcpu->arch.efer; break; -- MST
Il 03/02/2014 15:28, Peter Zijlstra ha scritto:
>>> > > Right, and the virt thingy has no PEBS, so there is no correct value that we
>>> > > could return from the MSR. That's why the CPUID bit is zero.
>> >
>> > There's more than PEBS in there, there's also the LBR format (which you
>> > obviously also don't have)
> On that.. LBR is purely model based, what do you guys do with those
> MSRs?
On Intel nothing, the MSRs always returns 0.
AMD did add LBR virtualization to its virtualization exception, so on
AMD you can use LBR from a guest.
Paolo
Commit-ID: c9b08884c9c98929ec2d8abafd78e89062d01ee7 Gitweb: http://git.kernel.org/tip/c9b08884c9c98929ec2d8abafd78e89062d01ee7 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Mon, 3 Feb 2014 14:29:03 +0100 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Fri, 21 Feb 2014 22:09:01 +0100 perf/x86: Correctly use FEATURE_PDCM The current code simply assumes Intel Arch PerfMon v2+ to have the IA32_PERF_CAPABILITIES MSR; the SDM specifies that we should check CPUID[1].ECX[15] (aka, FEATURE_PDCM) instead. This was found by KVM which implements v2+ but didn't provide the capabilities MSR. Change the code to DTRT; KVM will also implement the MSR and return 0. Cc: pbonzini@redhat.com Reported-by: "Michael S. Tsirkin" <mst@redhat.com> Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140203132903.GI8874@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/cpu/perf_event_intel.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index 698ae77..aa333d9 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -2308,10 +2308,7 @@ __init int intel_pmu_init(void) if (version > 1) x86_pmu.num_counters_fixed = max((int)edx.split.num_counters_fixed, 3); - /* - * v2 and above have a perf capabilities MSR - */ - if (version > 1) { + if (boot_cpu_has(X86_FEATURE_PDCM)) { u64 capabilities; rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);