From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB812C433E0 for ; Fri, 15 May 2020 22:23:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AC8FC20643 for ; Fri, 15 May 2020 22:23:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="dl8PIz8P" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726721AbgEOWXk (ORCPT ); Fri, 15 May 2020 18:23:40 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:31908 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726183AbgEOWXk (ORCPT ); Fri, 15 May 2020 18:23:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1589581417; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Mast8/ulH2Tav7VGd7dOuafB8vGOVhxtqzQHSIPHMcQ=; b=dl8PIz8Px5Okqb12ExkD8IjR1bsuYDjUHfUUlU1lD8SUWYEc+rA7o6i7AYyquPSlYN7qC8 QsndrOENb+uObkyHXx+wKWHDSpEl4/i4O5JzIeqa2QmBROxqVVdD5KhMqm3fMpt7QDsrCz Xxrz/5TwAztryVx6ZlTcjF0jsRSwzcM= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-236-bu68UmpCNUyUkw0EV-13LQ-1; Fri, 15 May 2020 18:23:36 -0400 X-MC-Unique: bu68UmpCNUyUkw0EV-13LQ-1 Received: by mail-wm1-f71.google.com with SMTP id w189so1814765wmg.1 for ; Fri, 15 May 2020 15:23:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=Mast8/ulH2Tav7VGd7dOuafB8vGOVhxtqzQHSIPHMcQ=; b=r/K6CfCqyp5/U2F1SFcgzHHVhyXeH9Rby2GDxbhAbfEUit43RjkaM6wCXMPhOXt8GA KzTRmmwuKZUQD18QhIh96IB5YprhliST/YnNo79L1hZ7xQmSU8IO7yaU5wlRhM8pTWds c3x3yrJQSnTjKD6M967+cBlA5hXcsRSgH3fHRotTK0m4akhtYjLJXBrfHKgMUBpo9hcR 1F/uSteorpSC7iB+z8H6dqqdxnNCtS8Vjoxa1zrI7QU6/GaEoojXWyA6QqFOW6xMmOsU B3lq2zeTdb0+0u5o5Ol8SeMETl3bKC6H6mUEVR/sfqfgA+h1xa8tiYgWh2HUeSygLCLk /OJQ== X-Gm-Message-State: AOAM533sZ/rRGYE1mvbH8GZLDwsPKIJgOTbqW+KaLbRaU8Y9+TB8/W+T i0b96dZ1dTr/zTCPXqsUB7DI3Hwtgs0FjEVeI86OkqCio1NOLOH/zeu9epuwJOkADcqmNicaLjy slheOS2pVbio6TUgCcKIBSvmH X-Received: by 2002:a5d:4ed0:: with SMTP id s16mr6940556wrv.166.1589581414872; Fri, 15 May 2020 15:23:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx0oiFzvaBdq6Kzo+VNSJDCPqwbyVSadNMC8i2dkTwza2qYxhs1/3DMwpFdYAvbePIbdnZwSw== X-Received: by 2002:a5d:4ed0:: with SMTP id s16mr6940529wrv.166.1589581414540; Fri, 15 May 2020 15:23:34 -0700 (PDT) Received: from [192.168.10.150] ([93.56.170.5]) by smtp.gmail.com with ESMTPSA id t14sm1604725wrs.1.2020.05.15.15.23.33 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 15 May 2020 15:23:33 -0700 (PDT) Subject: Re: [PATCH 2/8] KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info To: Sean Christopherson Cc: Vivek Goyal , Vitaly Kuznetsov , kvm@vger.kernel.org, x86@kernel.org, Andy Lutomirski , Thomas Gleixner , Borislav Petkov , "H. Peter Anvin" , Wanpeng Li , Jim Mattson , Gavin Shan , Peter Zijlstra , linux-kernel@vger.kernel.org, Andrew Cooper References: <20200511164752.2158645-1-vkuznets@redhat.com> <20200511164752.2158645-3-vkuznets@redhat.com> <20200512152709.GB138129@redhat.com> <87o8qtmaat.fsf@vitty.brq.redhat.com> <20200512155339.GD138129@redhat.com> <20200512175017.GC12100@linux.intel.com> <20200513125241.GA173965@redhat.com> <0733213c-9514-4b04-6356-cf1087edd9cf@redhat.com> <20200515184646.GD17572@linux.intel.com> <20200515204341.GF17572@linux.intel.com> From: Paolo Bonzini Message-ID: <943cfc2f-5b18-e00a-f5a2-4577472a1ff5@redhat.com> Date: Sat, 16 May 2020 00:23:31 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200515204341.GF17572@linux.intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15/05/20 22:43, Sean Christopherson wrote: > On Fri, May 15, 2020 at 09:18:07PM +0200, Paolo Bonzini wrote: >> On 15/05/20 20:46, Sean Christopherson wrote: >>> Why even bother using 'struct kvm_vcpu_pv_apf_data' for the #PF case? VMX >>> only requires error_code[31:16]==0 and SVM doesn't vet it at all, i.e. we >>> can (ab)use the error code to indicate an async #PF by setting it to an >>> impossible value, e.g. 0xaaaa (a is for async!). That partciular error code >>> is even enforced by the SDM, which states: >> >> Possibly, but it's water under the bridge now. > > Why is that? I thought we were redoing the entire thing because the current > ABI is unfixably broken? In other words, since the guest needs to change, > why are we keeping any of the current async #PF pieces? E.g. why keep using > #PF instead of usurping something like #NP? Because that would be 3 ABIs to support instead of 2. The #PF solution is only broken as long as you allow async PF from ring 0 (which wasn't even true except for preemptable kernels) _and_ have NMIs that can generate page faults. We also have the #PF vmexit part for nested virtualization. This adds up and makes a quick fix for 'page not ready' notifications not that quick. However, interrupts for 'page ready' do have a bunch of advantages (more control on what can be preempted by the notification, a saner check for new page faults which is effectively a bug fix) so it makes sense to get them in more quickly (probably 5.9 at this point due to the massive cleanups that are being done around interrupt vectors). >> And the #PF mechanism also has the problem with NMIs that happen before the >> error code is read and page faults happening in the handler. > > Hrm, I think there's no unfixable problem except for a pathological > #PF->NMI->#DB->#PF scenario. But it is a problem :-( Yeah, that made no sense. But still I'm not sure the x86 maintainers would like it. The only possible isue with #VE is the re-entrancy at the end. Andy proposed re-enabling it from an interrupt, but here is one solution that can be done almost entirely from C. The idea is to split the IST in two halves, and flip between them in the TSS with an XOR operation on entry to the interrupt handler. This is possible because there won't ever be more than two handlers active at the same time. Unlike if you used SUB/ADD, with XOR you don't have to restore the IST on exit: the two halves will take turns as the current IST and there's no problematic window between the ADD and the IRET. The pseudocode would be: - on #VE entry xor 512 with the IST address in the TSS check if saved RSP comes from the IST if so: overwrite the saved flags/CS/SS in the "other" IST half with the current value of the registers overwrite the saved RSP in the "other" IST half with the address of the top of the IST itself overwrite the saved RIP in the "other" IST half with the address of a trampoline (see below) else: save the top 5 words of the IST somewhere do normal stuff - the trampoline restores the 5 words at the top of the IST with five push instructions, and jumps back to the first instruction of the handler Everything except the first step can even be done in C. Here is an example. Assuming that on entry to the outer #VE the IST is the "even" half, the outer #VE moves the IST to the "odd" half and the return flags/CS/SS/RSP/RIP are saved. After the reentrancy flag has been cleared, a nested #VE arrives and runs within the "odd" half of the IST. The IST is moved back to the "even" half and the return flags/CS/SS/RSP/RIP in the "even" half are patched to point to the trampoline. When we get back to the outer handler the reentrancy flag not zero, so even though the IST points to the current stack, reentrancy is impossible and we go just fine through the few final instructions of the handler. On outer #VE exit, the IRET instruction jumps to the trampoline, with RSP pointing at the top of the "even" half. The return flags/CS/SS/RSP/RIP are restored, and everything restarts from the beginning: the outer #VE moves the IST to the "odd" half, the return flags/CS/SS/RSP/RIP are saved, the data for the nested #VE is fished out of the virtualization exception area and so on. Paolo