Re: KVM: unknown exit, hardware reason 31

From: Xiao Guangrong <guangrong.xiao@linux.intel.com>
To: Pavel Shirshov <ru.pchel@gmail.com>, kvm@vger.kernel.org
Subject: Re: KVM: unknown exit, hardware reason 31
Date: Mon, 27 Jul 2015 11:23:31 +0800	[thread overview]
Message-ID: <55B5A433.1090107@linux.intel.com> (raw)
In-Reply-To: <CAG+TGLP_c-LoL+7hfGHUZSqgqQhCX6r_s2j9sEShsPuLiD4shw@mail.gmail.com>


I guess it happened on this scenario:


1. QEMU drops mmio region
2. invalidate all mmio sptes
3.

         VCPU 0                          KVM        VCPU 1
     access the invalid mmio spte
                                    page reclaim
                                    zap shadow page

                                                 access the region originally was MMIO before
                                                 set the spte to the normal ram map

     mmio #PF
     check the spte and see it becomes normal ram mapping !!!


The issue is caused by fast invalidate mmio sptes which increases
generation number instead of zapping mmio sptes (SRCU can ensure the vcpu
either see mmio spte or being zapped / zapped sptes.).

The simple fix is just drop the check_direct_spte_mmio_pf(), let VCPU access
again as follows:

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4417146..299a5da 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3299,21 +3299,6 @@ static bool quickly_check_mmio_pf(struct kvm_vcpu *vcpu, u64 addr, bool direct)
         return vcpu_match_mmio_gva(vcpu, addr);
  }

-
-/*
- * On direct hosts, the last spte is only allows two states
- * for mmio page fault:
- *   - It is the mmio spte
- *   - It is zapped or it is being zapped.
- *
- * This function completely checks the spte when the last spte
- * is not the mmio spte.
- */
-static bool check_direct_spte_mmio_pf(u64 spte)
-{
-       return __check_direct_spte_mmio_pf(spte);
-}
-
  static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
  {
         struct kvm_shadow_walk_iterator iterator;
@@ -3356,13 +3341,6 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct)
         }

         /*
-        * It's ok if the gva is remapped by other cpus on shadow guest,
-        * it's a BUG if the gfn is not a mmio page.
-        */
-       if (direct && !check_direct_spte_mmio_pf(spte))
-               return RET_MMIO_PF_BUG;
-
-       /*
          * If the page table is zapped by other cpus, let CPU fault again on
          * the address.
          */

Pavel, could you please check if it works for you?

I will fully consider the case and post the right fix out...

On 07/25/2015 03:25 AM, Pavel Shirshov wrote:
> Hello,
>
> I'm running a lot of identical VMs under KVM. Sometimes (one time per
> 2000-3000 runs) I got following:
>
> 1. VM is paused in libvirt. It can't be just resumed. I can just reset
> it and resume.
> 2. In VM log file I see following: "KVM: unknown exit, hardware reason
> 31" with a CPU dump.
> 3. In dmesg I see following:
> [84245.284948] EPT: Misconfiguration.
> [84245.285056] EPT: GPA: 0xfeda848
> [84245.285154] ept_misconfig_inspect_spte: spte 0x5eaef50107 level 4
> [84245.285344] ept_misconfig_inspect_spte: spte 0x5f5fadc107 level 3
> [84245.285532] ept_misconfig_inspect_spte: spte 0x5141d18107 level 2
> [84245.285723] ept_misconfig_inspect_spte: spte 0x52e40dad77 level 1
>
> OS. 3.16.0-44-generic #59~14.04.1-Ubuntu SMP
> QEMU: QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.14),
> Copyright (c) 2003-2008 Fabrice Bellard
>
> Is it linux kvm bug or CPU bug? How can I fix that?
>
> I can reproduce the bug in one-two days. Is it possible to enable
> deeper debug for the issue?
>
> Thanks
>