From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3555BC4332F for ; Mon, 7 Nov 2022 22:53:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232416AbiKGWxx (ORCPT ); Mon, 7 Nov 2022 17:53:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35718 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229534AbiKGWxu (ORCPT ); Mon, 7 Nov 2022 17:53:50 -0500 Received: from mail-qv1-xf2d.google.com (mail-qv1-xf2d.google.com [IPv6:2607:f8b0:4864:20::f2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5411DC756 for ; Mon, 7 Nov 2022 14:53:49 -0800 (PST) Received: by mail-qv1-xf2d.google.com with SMTP id e15so9164882qvo.4 for ; Mon, 07 Nov 2022 14:53:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=21wiiHvcIBdhKiIS3r6/qtlDM55REGd/yjftKkenVRo=; b=nBj4Wfd3rL0glEyZ6O0dtIl1fd7ruGU3X/JrJwPB2CTsL1/ifRNFMsAZCT3JBuKEwm ygGxuBTUM/uV95zlp27Tddvthc8TOg1rRO0F0kKHDENkIUjWfwJbNuaYDb4T60cW8uQN Ybl1aPt7syI29QlsX7TBzoNfTy2euvDWrTjMCZ3GZlpLQdr2wpe8bvjYBKR+FnxTpgoE p622dtUO+6Gl8+wnK5Nr96VibHFtJCe8DJS2dW9AWL9KQ5irDxd/dlcUJoxCDNDQjvJk LJXqJom5FtFdzcIoqE9XiskjlkIgSSckBvvCyGwWaJOISZDVyciC8oK7SUPhF1nVPO3N 0m7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=21wiiHvcIBdhKiIS3r6/qtlDM55REGd/yjftKkenVRo=; b=UqbxJLa/jKX0IAYCajcv3+e/0dW2CSUH7npiu348/dMBxTGvGbfkHvWNomk3wkOIA8 RSa4lXhhVb45i15Cj+SxgZdtZfi0S12mHh2R19LukvxcK152rIej53kbmGQJhfnezQNS 0QcxK++B1OkoRCKwNtmJkJ/7jN25kd1RgLG9Y/D5sTwiNnUbeZ4zGtg3Y8odUxP9tuXG HriO6b8nzvr3XkDTakHx844yHvhxuk5P8/MVCtVMlVLmCYa6v40Adzsumw/Ne+pBg34P e7FHie21zYVIc/Mv3qgNk6hDl3FvH/nUknJeG/P8Glq9pQ0Yif5OQmq9X0gqo/7rT2Dj Ol7Q== X-Gm-Message-State: ACrzQf310vje1ClnRqbLOuPaAOZ11UnXJaKVo0BDWmKgDpqGZiDXpHYy Gc3qbrugtzov9b64HpoJCxs/vIhXDgZEkXBfZKgFxw== X-Google-Smtp-Source: AMsMyM6CzUj86XtAHxzGeXkV8lvKWK370HlG4ueNSlVh4hdkTs7X0dc1hiXZyblGIMI0QFGnxTpdAtY18OBujNBiSQo= X-Received: by 2002:a05:6214:2528:b0:4b8:2025:5f6e with SMTP id gg8-20020a056214252800b004b820255f6emr46887686qvb.88.1667861628171; Mon, 07 Nov 2022 14:53:48 -0800 (PST) MIME-Version: 1.0 References: <20221028141220.29217-3-kirill.shutemov@linux.intel.com> <4bfcd256-b926-9b1c-601c-efcff0d16605@intel.com> In-Reply-To: From: Erdem Aktas Date: Mon, 7 Nov 2022 14:53:37 -0800 Message-ID: Subject: Re: [PATCH 2/2] x86/tdx: Do not allow #VE due to EPT violation on the private memory To: Dave Hansen Cc: "Nakajima, Jun" , Guorui Yu , kirill.shutemov@linux.intel.com, ak@linux.intel.com, bp@alien8.de, dan.j.williams@intel.com, david@redhat.com, elena.reshetova@intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, luto@kernel.org, mingo@redhat.com, peterz@infradead.org, sathyanarayanan.kuppuswamy@linux.intel.com, seanjc@google.com, tglx@linutronix.de, thomas.lendacky@amd.com, x86@kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 4, 2022 at 3:50 PM Dave Hansen wrote: > > On 11/4/22 15:36, Erdem Aktas wrote: > > On Fri, Oct 28, 2022 at 7:12 AM Kirill A. Shutemov > > wrote: > >> + * > >> + * Kernel has no legitimate use-cases for #VE on private memory. It is > >> + * either a guest kernel bug (like access of unaccepted memory) or > >> + * malicious/buggy VMM that removes guest page that is still in use. > >> + * > > > > I think this statement is too strong and I have few concerns on this approach. > > I understand that there is an issue of handling #VEs on private pages > > but it seems like we are just hiding the problem with this approach > > instead of fixing it - I do not have any fix in my mind- . > > First there is a feature of injecting #VE to handle unaccepted pages > > at runtime and accept them on-demand, now the statement is saying this > > was an unnecessary feature (why is it there at all then?) at all as > > there is no legitimate use case. > > We're doing on-demand page acceptance. We just don't need a #VE to > drive it. Why is it in the TDX module then? Inertia? Because it got > too far along in the process before anyone asked me or some of the other > x86 kernel folks to look at it hard. > > > I wonder if this will limit how we can implement the lazy TDACCEPT. > > There are multiple ideas floating now. > > https://github.com/intel/tdx/commit/9b3ef9655b695d3c67a557ec016487fded8b0e2b > > has 3 implementation choices where "Accept a block of memory on the > > first use." option is implemented. Actually it says "Accept a block > > of memory on the first use." but it is implemented as "Accept a block > > of memory on the first allocation". The comments in this code also > > raises concerns on the performance. > > > > As of now, we do not know which one of those ideas will provide an > > acceptable performance for booting large size VMs. If the performance > > overhead is high, we can always implement the lazy TDACCEPT as when > > the first time a guest accesses an unaccepted memory, #VE can do the TDACCEPT. > > Could you please elaborate a bit on what you think the distinction is > between: > > * Accept on first use > and > * Accept on allocation > > Surely, for the vast majority of memory, it's allocated and then used > pretty quickly. As in, most allocations are __GFP_ZERO so they're > allocated and "used" before they even leave the allocator. So, in > practice, they're *VERY* close to equivalent. > > Where do you see them diverging? Why does it matter? > For a VM with a very large memory size, let's say close to 800G of memory, it might take a really long time to finish the initialization. If all allocations are __GFP_ZERO, then I agree it would not matter but -- I need to run some benchmarks to validate -- what I remember was, that was not what we were observing. Let me run a few tests to provide more input on this but meanwhile if you have already run some benchmarks, that would be great. What I see in the code is that the "accept_page" function will zero all the unaccepted pages even if the __GFP_ZERO flag is not set and if __GFP_ZERO is set, we will again zero all those pages. I see a lot of concerning comments like "Page acceptance can be very slow.". What I mean with "Accept on allocation" is leaving the memory allocation as it is and using the #VE handler to accept pages the first time they have been accessed. tLet me come back with some numbers on this which might take some time to collect. > > I am not trying to solve the lazy TDACCEPT problem here but all I am > > trying to say is that, there might be legitimate use cases for #VE on > > private memory and this patch limits any future improvement we might > > need to do on lazy TDACCEPT implementation. > > The kernel can't take exceptions on arbitrary memory accesses. I have > *ZERO* idea how to handle page acceptance on an access to a per-cpu > variable referenced in syscall entry, or the NMI stack when we've > interrupted kernel code with a user GSBASE value. > > So, we either find *ALL* the kernel memory that needs to be pre-accepted > at allocation time (like kernel stacks) or we just say that all > allocated memory needs to be accepted before we let it be allocated. > > One of those is really easy. The other means a boatload of code audits. > I know. I had to do that kind of exercise to get KPTI to work. I > don't want to do it again. It was worth it for KPTI when the world was > on fire. TDX isn't that important IMNHO. There's an easier way.