From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-23.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C25D1C433F5 for ; Thu, 16 Sep 2021 16:00:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A71E56120E for ; Thu, 16 Sep 2021 16:00:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240417AbhIPQBo (ORCPT ); Thu, 16 Sep 2021 12:01:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240185AbhIPQBn (ORCPT ); Thu, 16 Sep 2021 12:01:43 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C6489C061574 for ; Thu, 16 Sep 2021 09:00:22 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id f3-20020a17090a638300b00199097ddf1aso7912983pjj.0 for ; Thu, 16 Sep 2021 09:00:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=qY0iS8u2LJMGt4Ov6dPslL6Xeqan2bC0+CFm4DC6DyA=; b=EGnLDp6iMIbuxmJgrxlgyPdXStg4wUvMU3ujBKbXuvsAXhESpChnmL4Vw/TN4vovg3 lvGPL5avgzs9VcIFNCzyeY2mCMmw+QWuOvH8kslY0VD5n3RA18dtoh6GQaiwb+Ludaet 0FS/eQAWNY4+YXz+tThYq/Nz5SLkazkelI2GgqxfpRKkSw1h/JC/S/PUdACom7DaMNvG aooa3JQHpsz9s4G2hyFeF9RGVTJFlELpoZRWZCAtJjjykoEh9RUIUg2723L9EWW5uY8r GIAl0Ww03i6bf7Rc2Dq58t/td641ijBX5kK9/jY9Z32cE0MoDt99mMmGwY+YqtvQODys SPLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=qY0iS8u2LJMGt4Ov6dPslL6Xeqan2bC0+CFm4DC6DyA=; b=DnShMr5g7dD/ktEAeZ8XsPSASlO1SwCtnMroC0mpRVL+e9nqGE3+JNxlISRq21aIZn 57m/C4LyPlYAVLy4aIrD5ZrzfT6IYZs1Le/AsPr5whKx8p0EEX3sO6TRm/G0OC2MpUEH 5/yc5kfmRR5YUw8JSCvfEwmZEcN4PIAfSxpyDtf7xxKrRySNwIqMH54U81gGVObRotlu ePe6GlIldjhtvwgF+kqwrp+c9fjagjXMbfnZ4vtPQuZzU0ESllCOseV3QO82s4yduRWf tLQSINu57r9JXvWrVDcQcYL5JQygjLqQNL+Ng/9ePLFQRm0vWRHivffqrT/5ksTnAV37 rUWg== X-Gm-Message-State: AOAM532b/RgrrMcARTX8uns7AfsH5z+oHALG4Yf5zhjiT1bMtrbFxa7G H+y6qxNLfBKXPNNhXyzmbOyOsw== X-Google-Smtp-Source: ABdhPJxjGy69LL16KLFnGDrccmNfv4kC6HTE9iSk9xrylNW7kFwzlsT1dNtB1k1IoUtYhI0fUFKcNQ== X-Received: by 2002:a17:903:1104:b0:13a:41f1:6dd5 with SMTP id n4-20020a170903110400b0013a41f16dd5mr5268285plh.48.1631808022058; Thu, 16 Sep 2021 09:00:22 -0700 (PDT) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id cl16sm3300728pjb.23.2021.09.16.09.00.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Sep 2021 09:00:21 -0700 (PDT) Date: Thu, 16 Sep 2021 16:00:17 +0000 From: Sean Christopherson To: Hou Wenlong Cc: kvm@vger.kernel.org, Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , "H. Peter Anvin" , "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" Subject: Re: [PATCH v2 3/3] kvm: x86: Emulate hypercall instead of fixing hypercall instruction Message-ID: References: <7893d13e11648be0326834248fcb943088fb0b76.1631188011.git.houwenlong93@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7893d13e11648be0326834248fcb943088fb0b76.1631188011.git.houwenlong93@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Thu, Sep 09, 2021, Hou Wenlong wrote: > It is guest's resposibility to use right instruction for hypercall, > hypervisor could emulate wrong instruction instead of modifying > guest's instruction. > > Signed-off-by: Hou Wenlong > --- > @@ -8747,16 +8747,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) > } > EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); > > -static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt) > +static int emulator_hypercall(struct x86_emulate_ctxt *ctxt) > { > + int ret; > struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); > - char instruction[3]; > - unsigned long rip = kvm_rip_read(vcpu); > - > - static_call(kvm_x86_patch_hypercall)(vcpu, instruction); > > - return emulator_write_emulated(ctxt, rip, instruction, 3, > - &ctxt->exception); > + ret = kvm_emulate_hypercall_noskip(vcpu); I have mixed feelings on calling out of the emulator to do all this work. One on hand, I think it's a somewhat silly, arbitrary boundary. On the other hand, reading and writing GPRs directly means the emulation context holds stale data, especially in the case where the hypercall triggers an exit to userspace. > + if (ret) > + return X86EMUL_CONTINUE; > + return X86EMUL_IO_NEEDED; Unfortunately, simply returning X86EMUL_IO_NEEDED is not sufficient to handle the KVM_HC_MAP_GPA_RANGE case. x86_emulate_instruction() doesn't directly act on X86EMUL_IO_NEEDED, because x86_emulate_insn() doesn't return X86EMUL_*, it returns EMULATION_FAILED, EMULATION_OK, etc... x86_emulate_instruction() instead looks for other signals to differentiate between exception injection, PIO, MMIO, etc... The IO_NEEDED path would also need to provide an alternative complete_userspace_io callback, otherwise complete_hypercall_exit() will attempt to skip the instruction using e.g. vmcs.VM_EXIT_INSTRUCTION_LEN and likely send the guest into the weeds. In the prior patch, having kvm_emulate_hypercall_noskip() skip the CPL check is definitely wrong, and skipping Xen/Hyper-V hypercall support is odd, though arguably correct since Xen/Hyper-V hypercalls should never hit this path. All of the above are solvable problems, but there is a non-trivial cost in doing so, especially looking ahead to TDX support, which also needs/wants to split kvm_emulate_hypercall() but in slightly different ways[*]. I 100% agree that patching the hypercall instruction is awful. There are myriad fatal issues with the current approach: 1. Patches using an emulated guest write, which will fail if RIP is not mapped writable, and even injects a #PF into the guest on failure. 2. Doesn't ensure the write is "atomic", e.g. a hypercall that splits a page boundary will be handled as two separate writes, which means that a partial, corrupted instruction can be observed by a separate vCPU. 3. Doesn't serialize other CPU cores after updating the code stream. 4. Completely fails to account for the case where KVM is emulating due to invalid guest state with unrestricted_guest=0. Patching and retrying the instruction will result in vCPU getting stuck in an infinite loop. But, the "support" _so_ awful, especially #1, that there's practically zero chance that a modern guest kernel can rely on KVM to patch the guest. E.g. patching fails on modern Linux due to kernel code being mapped NX (barring admin override). This was addressed in the Linux guest back in 2014 by commit c1118b3602c2 ("x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only"). In other words, odds are very good that only old Linux guest kernels rely on KVM's patching. Because of that, my preference is to keep the patching, do our best to make it suck a little less, and aim to completely remove the patching entirely at some point in the future. For #1, inject a #UD instead of #PF if the write fails. The guest will still likely die, but the failure signature is much friendlier to debuggers. For #2 and #3, do nothing as fixing those is non-trivial. For #4, inject a #UD if the hypercall instruction is already the "right" instruction. I.e. retroactively define KVM's ABI to be that KVM hypercalls have undefined behavior in Big Real Mode and other modes that trigger invalid guest state (since the hypercalls will still work if unrestricted_guest=1). This can't be an ABI breakage since it does not work today and can't possibly have ever worked in the past. I have mostly-complete to do the above (I ran afoul of the NX thing), I'll hopefully get them out next week. [*] https://lkml.kernel.org/r/9e1e66787c50232391e20cb4b3d1c5b249e3f910.1625186503.git.isaku.yamahata@intel.com