From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38671C433FE for ; Mon, 9 May 2022 22:56:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232045AbiEIXAU (ORCPT ); Mon, 9 May 2022 19:00:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232029AbiEIXAK (ORCPT ); Mon, 9 May 2022 19:00:10 -0400 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A858532D9; Mon, 9 May 2022 15:56:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1652136975; x=1683672975; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=+l021sdnbY03zc6MnzzVgd4KREt6Immd2x/BeBZRvqg=; b=V+NxhXwMHuaDEjAT+67xiLDNlBkH8+WTPfUh+ftBL9z0WjS2pksU5xKj H/bp//svSrm+g7G520gybl6JRKVW5W0JzGjdfIoRPJs1dYg114fEjE+ym 8K6i501iuz7tdeV1+AVMrNFNmyewbdgus8eyzedkW8gFKaUsyy+9qDMYW g/QjCiNIRbuXivdlNL54sasMZo+SCusEU97G+p5LTdYNSU+Awy/udved2 gQkBqKbsh+jDpHmizjm5Wxd+tB4NWYMuoIbyiCPnFp2S5/Fq8ytHA/6RS hX0beDZXRcK/dn5TAMvLDRKFY/eyjoXe0aj0ywBZBGR45ZpCvO1A6PB6Q g==; X-IronPort-AV: E=McAfee;i="6400,9594,10342"; a="355619535" X-IronPort-AV: E=Sophos;i="5.91,212,1647327600"; d="scan'208";a="355619535" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 May 2022 15:56:15 -0700 X-IronPort-AV: E=Sophos;i="5.91,212,1647327600"; d="scan'208";a="634421767" Received: from dmansurr-mobl.amr.corp.intel.com (HELO [10.212.251.158]) ([10.212.251.158]) by fmsmga004-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 May 2022 15:56:13 -0700 Message-ID: <71c0e2b4-1a58-62ad-b8af-9e00fdd1222d@intel.com> Date: Mon, 9 May 2022 15:56:17 -0700 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [PATCH v8 0/8] x86: Show in sysfs if a memory node is able to do encryption Content-Language: en-US To: Borislav Petkov Cc: Dan Williams , Martin Fernandez , Linux Kernel Mailing List , linux-efi , platform-driver-x86@vger.kernel.org, Linux MM , "H. Peter Anvin" , daniel.gutson@eclypsium.com, Darren Hart , Andy Shevchenko , Kees Cook , Andrew Morton , Ard Biesheuvel , Ingo Molnar , Thomas Gleixner , Dave Hansen , "Rafael J. Wysocki" , X86 ML , "Schofield, Alison" , hughsient@gmail.com, alex.bazhaniuk@eclypsium.com, Greg KH , Mike Rapoport , Ben Widawsky , "Huang, Kai" , Sean Christopherson , "Shutemov, Kirill" , Kuppuswamy Sathyanarayanan , Tom Lendacky , Michael Roth References: <6d90c832-af4a-7ed6-4f72-dae08bb69c37@intel.com> <47140A56-D3F8-4292-B355-5F92E3BA9F67@alien8.de> <6abea873-52a2-f506-b21b-4b567bee1874@intel.com> <4bc56567-e2ce-40ec-19ab-349c8de8d969@intel.com> From: Dave Hansen In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/9/22 15:17, Borislav Petkov wrote: > >> This new ABI provides a way to avoid that situation in the first place. >> Userspace can look at sysfs to figure out which NUMA nodes support >> "encryption" (aka. TDX) and can use the existing NUMA policy ABI to >> avoid TDH.MEM.PAGE.ADD failures. >> >> So, here's the question for the TDX folks: are these mixed-capability >> systems a problem for you? Does this ABI help you fix the problem? > What I'm not really sure too is, is per-node granularity ok? I guess it > is but let me ask it anyway... I think nodes are the only sane granularity. tl;dr: Zones might work in theory but have no existing useful ABI around them and too many practical problems. Nodes are the only other real option without inventing something new and fancy. -- What about zones (or any sub-node granularity really)? Folks have, for instance, discussed adding new memory zones for this purpose: have ZONE_NORMAL, and then ZONE_UNENCRYPTABLE (or something similar). Zones are great because they have their own memory allocation pools and can be targeted directly from within the kernel using things like GFP_DMA. If you run out of ZONE_FOO, you can theoretically just reclaim ZONE_FOO. But, even a single new zone isn't necessarily good enough. What if we have some ZONE_NORMAL that's encryption-capable and some that's not? The same goes for ZONE_MOVABLE. We'd probably need at least: ZONE_NORMAL ZONE_NORMAL_UNENCRYPTABLE ZONE_MOVABLE ZONE_MOVABLE_UNENCRYPTABLE Also, zones are (mostly) not exposed to userspace. If we want userspace to be able to specify encryption capabilities, we're talking about new ABI for enumeration and policy specification. Why node granularity? First, for the majority of cases, nodes "just work". ACPI systems with an "HMAT" table already separate out different performance classes of memory into different Proximity Domains (PXMs) which the kernel maps into NUMA nodes. This means that for NVDIMMs or virtually any CXL memory regions (one or more CXL devices glued together) we can think of, they already get their own NUMA node. Those nodes have their own zones (implicitly) and can lean on the existing NUMA ABI for enumeration and policy creation. Basically, the firmware creates the NUMA nodes for the kernel. All the kernel has to do is acknowledge which of them can do encryption or not. The one place where nodes fall down is if a memory hot-add occurs within an existing node and the newly hot-added memory does not match the encryption capabilities of the existing memory. The kernel basically has two options in that case: * Throw away the memory until the next reboot where the system might be reconfigured in a way to support more uniform capabilities (this is actually *likely* for a reboot of a TDX system) * Create a synthetic NUMA node to hold it Neither one of those is a horrible option. Throwing the memory away is the most likely way TDX will handle this situation if it pops up. For now, the folks building TDX-capable BIOSes claim emphatically that such a system won't be built.