From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F0FA2C433EF for ; Thu, 30 Jun 2022 08:25:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233717AbiF3IZ1 (ORCPT ); Thu, 30 Jun 2022 04:25:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233484AbiF3IYo (ORCPT ); Thu, 30 Jun 2022 04:24:44 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9122D186E6 for ; Thu, 30 Jun 2022 01:24:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656577480; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hqIzgmR5tuOUBDBCfoAqKMfRDeawIdE7bmJ0sAWUGNs=; b=e1+CpfvJ3iKaS02Jc4SMYgoCVk+Ko2wIpyQ7kSGVG1lJbk0RgCEk4gJMoC6YprN9WQm8Dd Z2ZCYKVptfVSXNJOSMWDsWayiZr/fB0SkBUEL1PBkjR7czEfToPJ91l9sZZAFmgUMM78di hbUzUYLF5gIatXIHEXh0q4M/DAq0FP4= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-249-1U9-LxXeOyGGoyxZ1ALiTA-1; Thu, 30 Jun 2022 04:24:37 -0400 X-MC-Unique: 1U9-LxXeOyGGoyxZ1ALiTA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E05B93C1174C; Thu, 30 Jun 2022 08:24:36 +0000 (UTC) Received: from starship (unknown [10.40.194.38]) by smtp.corp.redhat.com (Postfix) with ESMTP id 932F4112131B; Thu, 30 Jun 2022 08:24:34 +0000 (UTC) Message-ID: Subject: Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups From: Maxim Levitsky To: Jim Mattson Cc: Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Oliver Upton , Peter Shier Date: Thu, 30 Jun 2022 11:24:33 +0300 In-Reply-To: References: <20220614204730.3359543-1-seanjc@google.com> <7e05e0befa13af05f1e5f0fd8658bc4e7bdf764f.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2022-06-29 at 08:53 -0700, Jim Mattson wrote: > On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky wrote: > > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote: > > > The main goal of this series is to fix KVM's longstanding bug of not > > > honoring L1's exception intercepts wants when handling an exception that > > > occurs during delivery of a different exception. E.g. if L0 and L1 are > > > using shadow paging, and L2 hits a #PF, and then hits another #PF while > > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT, > > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF > > > so that the #PF is routed to L1, not injected into L2 as a #DF. > > > > > > nVMX has hacked around the bug for years by overriding the #PF injector > > > for shadow paging to go straight to VM-Exit, and nSVM has started doing > > > the same. The hacks mostly work, but they're incomplete, confusing, and > > > lead to other hacky code, e.g. bailing from the emulator because #PF > > > injection forced a VM-Exit and suddenly KVM is back in L1. > > > > > > Everything leading up to that are related fixes and cleanups I encountered > > > along the way; some through code inspection, some through tests. > > > > > > v2: > > > - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID > > > overhaul. > > > https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com > > > - Treat KVM_REQ_TRIPLE_FAULT as a pending exception. > > > > > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com > > > > > > Sean Christopherson (21): > > > KVM: nVMX: Unconditionally purge queued/injected events on nested > > > "exit" > > > KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS > > > KVM: x86: Don't check for code breakpoints when emulating on exception > > > KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like > > > KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag > > > KVM: x86: Treat #DBs from the emulator as fault-like (code and > > > DR7.GD=1) > > > KVM: x86: Use DR7_GD macro instead of open coding check in emulator > > > KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS > > > KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit > > > KVM: VMX: Inject #PF on ENCLS as "emulated" #PF > > > KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception > > > KVM: x86: Make kvm_queued_exception a properly named, visible struct > > > KVM: x86: Formalize blocking of nested pending exceptions > > > KVM: x86: Use kvm_queue_exception_e() to queue #DF > > > KVM: x86: Hoist nested event checks above event injection logic > > > KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential > > > VM-Exit > > > KVM: x86: Morph pending exceptions to pending VM-Exits at queue time > > > KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions > > > KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle > > > behavior > > > KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes > > > KVM: selftests: Add an x86-only test to verify nested exception > > > queueing > > > > > > arch/x86/include/asm/kvm-x86-ops.h | 2 +- > > > arch/x86/include/asm/kvm_host.h | 35 +- > > > arch/x86/kvm/emulate.c | 3 +- > > > arch/x86/kvm/svm/nested.c | 102 ++--- > > > arch/x86/kvm/svm/svm.c | 18 +- > > > arch/x86/kvm/vmx/nested.c | 319 +++++++++----- > > > arch/x86/kvm/vmx/sgx.c | 2 +- > > > arch/x86/kvm/vmx/vmx.c | 53 ++- > > > arch/x86/kvm/x86.c | 404 +++++++++++------- > > > arch/x86/kvm/x86.h | 11 +- > > > tools/testing/selftests/kvm/.gitignore | 1 + > > > tools/testing/selftests/kvm/Makefile | 1 + > > > .../selftests/kvm/include/x86_64/svm_util.h | 7 +- > > > .../selftests/kvm/include/x86_64/vmx.h | 51 +-- > > > .../kvm/x86_64/nested_exceptions_test.c | 295 +++++++++++++ > > > 15 files changed, 886 insertions(+), 418 deletions(-) > > > create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c > > > > > > > > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb > > > > Hi Sean and everyone! > > > > > > Before I continue reviewing the patch series, I would like you to check if > > I understand the monitor trap/pending debug exception/event injection > > logic on VMX correctly. I was looking at the spec for several hours and I still have more > > questions that answers about it. > > > > So let me state what I understand: > > > > 1. Event injection (aka eventinj in SVM terms): > > > > (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN) > > > > If I understand correctly all event injections types just like on SVM just inject, > > and never create something pending, and/or drop the injection if event is not allowed > > (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry, > > if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0, > > I haven't checked this) > > > > All event injections happen right away, don't deliver any payload (like DR6), etc. > > > > Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that, > > type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior > > to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent > > with cases when these trap like events are intercepted, where the interception happens > > on the start of the instruction despite exceptions being trap-like. > > > > > > 2. #DB is the only trap like exception that can be pending for one more instruction > > if MOV SS shadow is on (any other cases?). > > (AMD just ignores the whole thing, rightfully) > > > > That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field. > > I understand that it will be written by CPU in case we have VM exit at the moment > > where #DB is already pending but not yet delivered. > > > > That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it, > > and this #DB will actually update DR6 and such, and might be delayed/lost. > > > > > > 3. Facts about MTF: > > > > * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed > > instruction', and is enabled in primary execution controls. > > > > * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7, > > and that has no connection to the 'feature', although usually this injection will be useful > > when the hypervisor does some kind of re-injection, triggered by the actual MTF feature. > > > > * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF', > > which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit > > overrides it. > > > > * MTF event is raised (when the primary execution controls bit is enabled) when: > > > > - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state > > (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler) > > I am not 100% sure about this but this seems to be what PRM implies: > > > > "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section > > 26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the > > VM entry." > > > > - If an interrupt and or #DB exception happens prior to executing first instruction of the guest, > > then once again MTF will happen on first instruction of the exception/interrupt handler > > > > "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event > > (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending > > on the instruction boundary following delivery of the event (or any nested exception)." > > > > That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB > > > > - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen. > > > > - Otherwise after an instruction (or REP iteration) retires. > > > > > > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen! > > Here's a comprehensive spreadsheet on virtualizing MTF, compiled by > Peter Shier. (Just in case anyone is interested in *truly* > virtualizing the feature under KVM, rather than just setting a > VM-execution control bit in vmcs02 and calling it done.) > > https://docs.google.com/spreadsheets/d/e/2PACX-1vQYP3PgY_JT42zQaR8uMp4U5LCey0qSlvMb80MLwjw-kkgfr31HqLSqAOGtdZ56aU2YdVTvfkruhuon/pubhtml Neither can I access this document sadly :( Best regards, Maxim Levitsky >