From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCE52C04A6A for ; Tue, 8 Aug 2023 20:29:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231343AbjHHU3G (ORCPT ); Tue, 8 Aug 2023 16:29:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50234 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230401AbjHHU2z (ORCPT ); Tue, 8 Aug 2023 16:28:55 -0400 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CEB20157923 for ; Tue, 8 Aug 2023 10:07:37 -0700 (PDT) Received: by mail-pg1-x549.google.com with SMTP id 41be03b00d2f7-56385c43eaeso36595a12.1 for ; Tue, 08 Aug 2023 10:07:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691514457; x=1692119257; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=FYydZ/WC/X7PBz3rGSs121PfBH3fb5ehFOJQFZBx3p4=; b=GV0T5JGICTxBHQToBy1PfSyoy2i4bM1zH73FdGtuKSDbeUUFxy0mudpUdMA9EPwcN/ j9JXJzSqXP5zsg/86aZbjj3WK6wE6yNNwTycSPQEiglcligbGzAAlM+yBOTID7rqdjfj 6Aji/NflE6sHuE7Q7L0y/h/uyQwprnXyF7UHP7SQXMUAqy0cM5MagWJ74MsmFB9uNvz4 Fhzeb/hC7QxHmzNY8sct7/oI+RxQxXyGdMlFt1HKVaf4LrBL8E5QrHqUnWUbGvMbH5N1 orjBrAWkw9u53y0m5n5LsKUHYYdC/jL1/eWH2X30bvMRM3MC3nrYUuz3kYjwwZb6Kd5V fCJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691514457; x=1692119257; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FYydZ/WC/X7PBz3rGSs121PfBH3fb5ehFOJQFZBx3p4=; b=TnxMtC+tQOvcSPJr5TAqEKcjBEY5sO5mb8iWSDg4HCkuZSq/nub8zg2eaD9l6mjGCo 1caFWeTH39P+tpw3qBOdStIoD3eniIHJ29djCxwHT2R3MtipVvU4THk7KmZP48QIdeRP 0ZuXs8UJtHW1kbWc9WKxkQs4y9A3mjpjOH04SSbo2DbygteaMn7nhk5K/fIHjZIyZGwq +tVC+Taijre9NpDDcsjDJFy/NwgDDaNpCXuD+fwXkLUHl9D3OhU8ikOAbj21+yFulP5N 2IpbigwOwx6keaEeialbIvjiaRjXqqkMhJCslt3K1uePQDQvzcQEz3G47tQ2OrHDITvp 029Q== X-Gm-Message-State: AOJu0YwqnnI7JiqxqBRaNxU/gussNMkGCfpm575DKayXlKxUGLl/8Z2D Y8mUxbhtagRkMCu5LU1/yV+IrxwUyrw= X-Google-Smtp-Source: AGHT+IHKPz9NN9UwaUpEVU90XfLOlmfjcXFMG76C/HmVH7eT9tRDEVaJTacVs/21ngorplmfMSbyl1i9APo= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a63:5264:0:b0:564:4a9e:1b80 with SMTP id s36-20020a635264000000b005644a9e1b80mr76736pgl.6.1691514457246; Tue, 08 Aug 2023 10:07:37 -0700 (PDT) Date: Tue, 8 Aug 2023 10:07:35 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230721143407.2654728-1-amaan.cheval@gmail.com> Message-ID: Subject: Re: Deadlock due to EPT_VIOLATION From: Sean Christopherson To: Amaan Cheval Cc: brak@gameservers.com, kvm@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Tue, Aug 08, 2023, Amaan Cheval wrote: > Hey Sean, > > > If NUMA balancing is going nuclear and constantly zapping PTEs, the resulting > > mmu_notifier events could theoretically stall a vCPU indefinitely. The reason I > > dislike NUMA balancing is that it's all too easy to end up with subtle bugs > > and/or misconfigured setups where the NUMA balancing logic zaps PTEs/SPTEs without > > actuablly being able to move the page in the end, i.e. it's (IMO) too easy for > > NUMA balancing to get false positives when determining whether or not to try and > > migrate a page. > > What are some situations where it might not be able to move the page in the end? There's a pretty big list, see the "failure" paths of do_numa_page() and migrate_misplaced_page(). > > That said, it's definitely very unexpected that NUMA balancing would be zapping > > SPTEs to the point where a vCPU can't make forward progress. It's theoretically > > possible that that's what's happening, but quite unlikely, especially since it > > sounds like you're seeing issues even with NUMA balancing disabled. > > Yep, we're definitely seeing the issue occur even with numa_balancing > enabled, but the likelihood of it occurring has significantly dropped since > we've disabled numa_balancing. So the fact that this occurs with NUMA balancing disabled means the problem likely isn't with NUMA balancing itself. NUMA balancing probably exposes some underlying issue due to it generating a near-constant stream of mmu_notifier invalidation. > > More likely is that there is a bug somewhere that results in the mmu_notifier > > event refcount staying incorrectly eleveated, but that type of bug shouldn't follow > > the VM across a live migration... > > Good news! We managed to live migrate a guest and that did "fix it". ... > A colleague also modified a host kernel with KFI (Kernel Function > Instrumentation) and wrote a kernel module to intercept the vmexit handler, > handle_ept_violation, and does an EPT walk for each pfn, compared against > /proc/iomem. > > Assuming the EPT walking code is correct, we see this surprising result of a > PDPTE's pfn=0: Not surprising. The entire EPTE is zero, i.e. has been zapped by KVM. This is exactly what is expected. > Does this seem to indicate an mmu_notifier refcount issue to you, given that > migration did fix it? Any way to verify? It doesn't move the needle either way, it just confirms what we already know: the vCPU is repeatedly taking !PRESENT faults. The unexpected part is that KVM never "fixes" the fault and never outright fails. > We haven't found any guests with `softlockup_panic=1` yet, and since we can't > reproduce the issue on command ourselves yet, we might have to wait a bit - but > I imagine that the fact that live migration "fixed" the locked up guest confirms > that the other guests that didn't get "fixed" were likely softlocked from the > CPU stalling? Yes. > If you have any suggestions on how modifying the host kernel (and then migrating > a locked up guest to it) or eBPF programs that might help illuminate the issue > further, let me know! > > Thanks for all your help so far! Since it sounds like you can test with a custom kernel, try running with this patch and then enable the kvm_page_fault tracepoint when a vCPU gets stuck. The below expands said tracepoint to capture information about mmu_notifiers and memslots generation. With luck, it will reveal a smoking gun. --- arch/x86/kvm/mmu/mmu.c | 10 ---------- arch/x86/kvm/mmu/mmu_internal.h | 2 ++ arch/x86/kvm/mmu/tdp_mmu.h | 10 ++++++++++ arch/x86/kvm/trace.h | 28 ++++++++++++++++++++++++++-- 4 files changed, 38 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 9e4cd8b4a202..122bfc884293 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2006,16 +2006,6 @@ static bool kvm_mmu_remote_flush_or_zap(struct kvm *kvm, return true; } -static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) -{ - if (sp->role.invalid) - return true; - - /* TDP MMU pages do not use the MMU generation. */ - return !is_tdp_mmu_page(sp) && - unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); -} - struct mmu_page_path { struct kvm_mmu_page *parent[PT64_ROOT_MAX_LEVEL]; unsigned int idx[PT64_ROOT_MAX_LEVEL]; diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index f1ef670058e5..cf7ba0abaa8f 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -6,6 +6,8 @@ #include #include +#include "mmu.h" + #ifdef CONFIG_KVM_PROVE_MMU #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x) #else diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h index 0a63b1afabd3..a0d7c8acf78f 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.h +++ b/arch/x86/kvm/mmu/tdp_mmu.h @@ -76,4 +76,14 @@ static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return sp->tdp_mmu static inline bool is_tdp_mmu_page(struct kvm_mmu_page *sp) { return false; } #endif +static inline bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp) +{ + if (sp->role.invalid) + return true; + + /* TDP MMU pages do not use the MMU generation. */ + return !is_tdp_mmu_page(sp) && + unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen); +} + #endif /* __KVM_X86_MMU_TDP_MMU_H */ diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 83843379813e..ff4a384ab03a 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -8,6 +8,8 @@ #include #include +#include "mmu/tdp_mmu.h" + #undef TRACE_SYSTEM #define TRACE_SYSTEM kvm @@ -405,6 +407,13 @@ TRACE_EVENT(kvm_page_fault, __field( unsigned long, guest_rip ) __field( u64, fault_address ) __field( u64, error_code ) + __field( unsigned long, mmu_invalidate_seq) + __field( long, mmu_invalidate_in_progress) + __field( unsigned long, mmu_invalidate_range_start) + __field( unsigned long, mmu_invalidate_range_end) + __field( bool, root_is_valid) + __field( bool, root_has_sp) + __field( bool, root_is_obsolete) ), TP_fast_assign( @@ -412,11 +421,26 @@ TRACE_EVENT(kvm_page_fault, __entry->guest_rip = kvm_rip_read(vcpu); __entry->fault_address = fault_address; __entry->error_code = error_code; + __entry->mmu_invalidate_seq = vcpu->kvm->mmu_invalidate_seq; + __entry->mmu_invalidate_in_progress = vcpu->kvm->mmu_invalidate_in_progress; + __entry->mmu_invalidate_range_start = vcpu->kvm->mmu_invalidate_range_start; + __entry->mmu_invalidate_range_end = vcpu->kvm->mmu_invalidate_range_end; + __entry->root_is_valid = VALID_PAGE(vcpu->arch.mmu->root.hpa); + __entry->root_has_sp = VALID_PAGE(vcpu->arch.mmu->root.hpa) && + to_shadow_page(vcpu->arch.mmu->root.hpa); + __entry->root_is_obsolete = VALID_PAGE(vcpu->arch.mmu->root.hpa) && + to_shadow_page(vcpu->arch.mmu->root.hpa) && + is_obsolete_sp(vcpu->kvm, to_shadow_page(vcpu->arch.mmu->root.hpa)); ), - TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx", + TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx, seq = 0x%lx, in_prog = 0x%lx, start = 0x%lx, end = 0x%lx, root = %s", __entry->vcpu_id, __entry->guest_rip, - __entry->fault_address, __entry->error_code) + __entry->fault_address, __entry->error_code, + __entry->mmu_invalidate_seq, __entry->mmu_invalidate_in_progress, + __entry->mmu_invalidate_range_start, __entry->mmu_invalidate_range_end, + !__entry->root_is_valid ? "invalid" : + !__entry->root_has_sp ? "no shadow page" : + __entry->root_is_obsolete ? "obsolete" : "fresh") ); /* base-commit: 240f736891887939571854bd6d734b6c9291f22e --