kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Like Xu <like.xu.linux@gmail.com>
To: Sean Christopherson <seanjc@google.com>,
	Zhu Lingshan <lingshan.zhu@intel.com>
Cc: peterz@infradead.org, pbonzini@redhat.com, bp@alien8.de,
	vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com,
	joro@8bytes.org, kan.liang@linux.intel.com, ak@linux.intel.com,
	wei.w.wang@intel.com, eranian@google.com,
	liuxiangdong5@huawei.com, linux-kernel@vger.kernel.org,
	x86@kernel.org, kvm@vger.kernel.org, boris.ostrvsky@oracle.com,
	Like Xu <like.xu@linux.intel.com>, Will Deacon <will@kernel.org>,
	Marc Zyngier <maz@kernel.org>, Guo Ren <guoren@kernel.org>,
	Nick Hu <nickhu@andestech.com>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.cs.columbia.edu, linux-csky@vger.kernel.org,
	linux-riscv@lists.infradead.org, xen-devel@lists.xenproject.org
Subject: Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
Date: Fri, 27 Aug 2021 14:31:43 +0800	[thread overview]
Message-ID: <2f3d9676-3f49-3963-20d5-7aaf81ce996e@gmail.com> (raw)
In-Reply-To: <YSfykbECnC6J02Yk@google.com>

On 27/8/2021 3:59 am, Sean Christopherson wrote:
> TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
>         foundation that I would like to fix.

Obviously, this patch is not closely related to the guest PEBS feature enabling,
and we can certainly put this issue in another discussion thread [1].

[1] https://lore.kernel.org/kvm/20210827005718.585190-1-seanjc@google.com/

> 
> On Fri, Aug 06, 2021, Zhu Lingshan wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 464917096e73..e466fc8176e1 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
>>    */
>>   struct perf_guest_info_callbacks *perf_guest_cbs;
>>   
>> +/* explicitly use __weak to fix duplicate symbol error */
>> +void __weak arch_perf_update_guest_cbs(void)
>> +{
>> +}
>> +
>>   int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
>>   {
>> +	if (WARN_ON_ONCE(perf_guest_cbs))
>> +		return -EBUSY;
>> +
>>   	perf_guest_cbs = cbs;
>> +	arch_perf_update_guest_cbs();
> 
> This is horribly broken, it fails to cleanup the static calls when KVM unregisters
> the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded.
> The explosion doesn't happen until 'kvm' is unloaded because the functions are
> implemented in 'kvm', i.e. the use-after-free is deferred a bit.
> 
>    BUG: unable to handle page fault for address: ffffffffa011bb90
>    #PF: supervisor instruction fetch in kernel mode
>    #PF: error_code(0x0010) - not-present page
>    PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0
>    Oops: 0010 [#1] PREEMPT SMP
>    CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460
>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>    RIP: 0010:0xffffffffa011bb90
>    Code: Unable to access opcode bytes at RIP 0xffffffffa011bb66.
>    Call Trace:
>     <NMI>
>     ? perf_misc_flags+0xe/0x50
>     ? perf_prepare_sample+0x53/0x6b0
>     ? perf_event_output_forward+0x67/0x160
>     ? kvm_clock_read+0x14/0x30
>     ? kvm_sched_clock_read+0x5/0x10
>     ? sched_clock_cpu+0xd/0xd0
>     ? __perf_event_overflow+0x52/0xf0
>     ? handle_pmi_common+0x1f2/0x2d0
>     ? __flush_tlb_all+0x30/0x30
>     ? intel_pmu_handle_irq+0xcf/0x410
>     ? nmi_handle+0x5/0x260
>     ? perf_event_nmi_handler+0x28/0x50
>     ? nmi_handle+0xc7/0x260
>     ? lock_release+0x2b0/0x2b0
>     ? default_do_nmi+0x6b/0x170
>     ? exc_nmi+0x103/0x130
>     ? end_repeat_nmi+0x16/0x1f
>     ? lock_release+0x2b0/0x2b0
>     ? lock_release+0x2b0/0x2b0
>     ? lock_release+0x2b0/0x2b0
>     </NMI>
>    Modules linked in: irqbypass [last unloaded: kvm]
> 
> Even more fun, the existing perf_guest_cbs framework is also broken, though it's
> much harder to get it to fail, and probably impossible to get it to fail without
> some help.  The issue is that perf_guest_cbs is global, which means that it can
> be nullified by KVM (during module unload) while the callbacks are being accessed
> by a PMI handler on a different CPU.
> 
> The bug has escaped notice because all dererfences of perf_guest_cbs follow the
> same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the
> compiler never reload perf_guest_cbs in this sequence.  The compiler does reload
> perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but
> guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs,
> KVM has to completely exit the guest and teardown down all VMs before it can be
> unloaded.
> 
> But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger
> a NULL pointer derference, e.g. this tweak
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1eb45139fcc6..202e5ad97f82 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
>   {
>          int misc = 0;
> 
> -       if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
> +       if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
>                  if (perf_guest_cbs->is_user_mode())
>                          misc |= PERF_RECORD_MISC_GUEST_USER;
>                  else
> 
> 
> while spamming module load/unload leads to:
> 
>    BUG: kernel NULL pointer dereference, address: 0000000000000000
>    #PF: supervisor read access in kernel mode
>    #PF: error_code(0x0000) - not-present page
>    PGD 0 P4D 0
>    Oops: 0000 [#1] PREEMPT SMP
>    CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>    RIP: 0010:perf_misc_flags+0x1c/0x70
>    Call Trace:
>     perf_prepare_sample+0x53/0x6b0
>     perf_event_output_forward+0x67/0x160
>     __perf_event_overflow+0x52/0xf0
>     handle_pmi_common+0x207/0x300
>     intel_pmu_handle_irq+0xcf/0x410
>     perf_event_nmi_handler+0x28/0x50
>     nmi_handle+0xc7/0x260
>     default_do_nmi+0x6b/0x170
>     exc_nmi+0x103/0x130
>     asm_exc_nmi+0x76/0xbf
> 
> 
> The good news is that I have a series that should fix both the existing NULL pointer
> bug and mostly obviate the need for static calls.  The bad news is that my approach,
> making perf_guest_cbs per-CPU, likely complicates turning these into static calls,
> though I'm guessing it's still a solvable problem.
> 
> Tangentially related, IMO we should make architectures opt-in to getting
> perf_guest_cbs and nuke all of the code in the below files.  Except for arm,
> which recently lost KVM support, it's all a bunch of useless copy-paste code that
> serves no purpose and just complicates cleanups like this.
> 
>>   arch/arm/kernel/perf_callchain.c   | 16 +++++++-----
>>   arch/csky/kernel/perf_callchain.c  |  4 +--
>>   arch/nds32/kernel/perf_event_cpu.c | 16 +++++++-----
>>   arch/riscv/kernel/perf_callchain.c |  4 +--

  reply	other threads:[~2021-08-27  6:32 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-06 13:37 [PATCH V10 00/18] KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks Zhu Lingshan
2021-08-26 19:59   ` Sean Christopherson
2021-08-27  6:31     ` Like Xu [this message]
2021-09-15  1:19     ` Zhu, Lingshan
2021-09-21 23:22       ` Sean Christopherson
2021-08-27 17:23   ` Sean Christopherson
2021-08-06 13:37 ` [PATCH V10 02/18] perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 03/18] perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 04/18] perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled Zhu Lingshan
2021-11-07 10:14   ` Liuxiangdong
2021-11-08  3:06     ` Like Xu
2021-11-08  4:07       ` Liuxiangdong
2021-11-08  4:11         ` Like Xu
2021-11-08  8:27           ` Liuxiangdong
2021-11-08  8:44             ` Like Xu
2021-11-08 10:06               ` Liuxiangdong
2021-08-06 13:37 ` [PATCH V10 06/18] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 07/18] x86/perf/core: Add pebs_capable to store valid PEBS_COUNTER_MASK value Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 08/18] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 09/18] KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 10/18] KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 11/18] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 12/18] KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 13/18] KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 14/18] KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h Zhu Lingshan
2021-08-06 13:37 ` [PATCH V10 15/18] KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations Zhu Lingshan
2021-08-06 13:38 ` [PATCH V10 16/18] KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability Zhu Lingshan
2021-08-06 13:38 ` [PATCH V10 17/18] KVM: x86/cpuid: Refactor host/guest CPU model consistency check Zhu Lingshan
2021-08-06 13:38 ` [PATCH V10 18/18] KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64 Zhu Lingshan
2021-08-18  5:27 ` [PATCH V10 00/18] KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS Zhu, Lingshan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2f3d9676-3f49-3963-20d5-7aaf81ce996e@gmail.com \
    --to=like.xu.linux@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=boris.ostrvsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=eranian@google.com \
    --cc=guoren@kernel.org \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kan.liang@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=like.xu@linux.intel.com \
    --cc=lingshan.zhu@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-csky@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=liuxiangdong5@huawei.com \
    --cc=maz@kernel.org \
    --cc=nickhu@andestech.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=seanjc@google.com \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).