From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E57D3C433F5 for ; Thu, 10 Mar 2022 00:51:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238879AbiCJAwe (ORCPT ); Wed, 9 Mar 2022 19:52:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234466AbiCJAwd (ORCPT ); Wed, 9 Mar 2022 19:52:33 -0500 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D25A6E56B for ; Wed, 9 Mar 2022 16:51:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646873493; x=1678409493; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=lkeupvj4kynTUScTDJhfQz76mAOO5ubgDQN4+bqAFrA=; b=jMHOnKYQi2RrjC34divGYtrvX/ZFvgoK4u64V+GJoE6YHUo4PkpxfpTZ 3f8DltYDBka2bWPkjnkv0XDxuorTqxTPS5gvijlVA/lxPTmu/6unHPeWQ Txct4qZDi1qJ4FjGVoF1dqkj+rAIA0utWsSoS/7kkalgn61Djeq564qAp nx1zNtxwt85Wc2y6JfxiAHd3T/h0FiDpMWA/CQFhYACn7gOiW+bRO1R81 60SB1QX5cP/kJIRbfu5ZYxdkxr6V7ehlkA6/wknq9ZvFGFUv99VYlFwIr En+L0cMiBHYLTP8QjKUlGAVT/p+FBSPKEPrcaJXtmvjkmXYuzl/Yd6T8F w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255313771" X-IronPort-AV: E=Sophos;i="5.90,169,1643702400"; d="scan'208";a="255313771" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Mar 2022 16:51:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,169,1643702400"; d="scan'208";a="496053296" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga003.jf.intel.com with ESMTP; 09 Mar 2022 16:51:26 -0800 Received: by black.fi.intel.com (Postfix, from userid 1000) id 70C69211; Thu, 10 Mar 2022 02:51:45 +0200 (EET) Date: Thu, 10 Mar 2022 03:51:45 +0300 From: "Kirill A. Shutemov" To: Dave Hansen Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, luto@kernel.org, peterz@infradead.org, sathyanarayanan.kuppuswamy@linux.intel.com, aarcange@redhat.com, ak@linux.intel.com, dan.j.williams@intel.com, david@redhat.com, hpa@zytor.com, jgross@suse.com, jmattson@google.com, joro@8bytes.org, jpoimboe@redhat.com, knsathya@kernel.org, pbonzini@redhat.com, sdeep@vmware.com, seanjc@google.com, tony.luck@intel.com, vkuznets@redhat.com, wanpengli@tencent.com, thomas.lendacky@amd.com, brijesh.singh@amd.com, x86@kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCHv5 11/30] x86/tdx: Handle in-kernel MMIO Message-ID: <20220310005145.hzv2lzxgs7uxblfr@black.fi.intel.com> References: <20220302142806.51844-1-kirill.shutemov@linux.intel.com> <20220302142806.51844-12-kirill.shutemov@linux.intel.com> <81a7ad6d-6bd9-7674-3229-67a5cd2e485a@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <81a7ad6d-6bd9-7674-3229-67a5cd2e485a@intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 08, 2022 at 01:26:28PM -0800, Dave Hansen wrote: > On 3/2/22 06:27, Kirill A. Shutemov wrote: > > In non-TDX VMs, MMIO is implemented by providing the guest a mapping > > which will cause a VMEXIT on access and then the VMM emulating the > > instruction that caused the VMEXIT. That's not possible for TDX VM. > > > > To emulate an instruction an emulator needs two things: > > > > - R/W access to the register file to read/modify instruction arguments > > and see RIP of the faulted instruction. > > > > - Read access to memory where instruction is placed to see what to > > emulate. In this case it is guest kernel text. > > > > Both of them are not available to VMM in TDX environment: > > > > - Register file is never exposed to VMM. When a TD exits to the module, > > it saves registers into the state-save area allocated for that TD. > > The module then scrubs these registers before returning execution > > control to the VMM, to help prevent leakage of TD state. > > > > - Memory is encrypted a TD-private key. The CPU disallows software > > other than the TDX module and TDs from making memory accesses using > > the private key. > > Memory encryption has zero to do with this. The TDX isolation > mechanisms are totally discrete from memory encryption, although they > are "neighbors" of sorts. Hm. I don't see why you say encryption is not relevant. VMM (host kernel) has ultimate access to guest memory cypher text. It can read it as cypher text without any issue (using KeyID-0). Could you elaborate on the point? > > == Patching TDX drivers == > > > > Rather than touching the entire kernel, it might also be possible to > > just go after drivers that use MMIO in TDX guests. Right now, that's > > limited only to virtio and some x86-specific drivers. > > > > All virtio MMIO appears to be done through a single function, which > > makes virtio eminently easy to patch. > > > > This approach will be adopted in the future, removing the bulk of > > MMIO #VEs. The #VE-based MMIO will remain serving non-virtio use cases. > > This still doesn't *quite* do it for me for a justification. Why can't > the non-virtio cases be converted as well? Why doesn't the "patching > MMIO sites" work for x86 code too? > > You really need to convince us that *this* approach will be required > forever. What if I add: Many drivers can potentially be used inside TDX guest (e.g. via device passthough or random device emulation by VMM), but very few will. Patching every possible driver is not practical. #VE-based MMIO provides functionality for everybody. Performance-critical cases can be optimized as needed. ? > > +static bool handle_mmio(struct pt_regs *regs, struct ve_info *ve) > > +{ > > + char buffer[MAX_INSN_SIZE]; > > + unsigned long *reg, val; > > + struct insn insn = {}; > > + enum mmio_type mmio; > > + int size, extend_size; > > + u8 extend_val = 0; > > + > > + if (copy_from_kernel_nofault(buffer, (void *)regs->ip, MAX_INSN_SIZE)) > > + return false; > > + > > + if (insn_decode(&insn, buffer, MAX_INSN_SIZE, INSN_MODE_64)) > > + return false; > > + > > + mmio = insn_decode_mmio(&insn, &size); > > + if (WARN_ON_ONCE(mmio == MMIO_DECODE_FAILED)) > > + return false; > > + > > + if (mmio != MMIO_WRITE_IMM && mmio != MMIO_MOVS) { > > + reg = insn_get_modrm_reg_ptr(&insn, regs); > > + if (!reg) > > + return false; > > + } > > + > > + ve->instr_len = insn.length; > > + > > + switch (mmio) { > > + case MMIO_WRITE: > > + memcpy(&val, reg, size); > > + return mmio_write(size, ve->gpa, val); > > + case MMIO_WRITE_IMM: > > + val = insn.immediate.value; > > + return mmio_write(size, ve->gpa, val); > > + case MMIO_READ: > > + case MMIO_READ_ZERO_EXTEND: > > + case MMIO_READ_SIGN_EXTEND: > > + break; > > + case MMIO_MOVS: > > + case MMIO_DECODE_FAILED: > > + /* > > + * MMIO was accessed with an instruction that could not be > > + * decoded or handled properly. It was likely not using io.h > > + * helpers or accessed MMIO accidentally. > > + */ > > + return false; > > + default: > > + /* Unknown insn_decode_mmio() decode value? */ > > + BUG(); > > + } > > BUG()s are bad. The set of insn_decode_mmio() return codes is known at > compile time. If we're really on the lookout for unknown values, why > not just: > > BUILD_BUG_ON(NR_MMIO_TYPES != 6); // or whatever This doesn't work. We can pretend that the function only forced to return values from the enum. The truth is that it can return whatever int it wants. Type system in C is too week to guarantee anything here. The BUG() is backstop for it. This BUILD_BUG_ON() is useless. Compiler complains about missing case in the switch anyway. > Also, there are *lots* of ways for this function to just fall over and > fail. Why does this particular failure mode deserve a BUG()? > > Is there a reason a BUG() is better than returning failure which > presumably sets off the #GP-like logic? BUG() here makes it clear that the handler itself is buggy. Returning false and kicking in #GP-like logic indicates that something wrong with the code that triggered #VE. I think it is an important distinction. > Also, now that I've read this a few times, I've been confused by the > same thing a few times. This is handling instructions that might read > or write or do both, correct? > > Should that be made explicit in a function comment? Hm. Okay. Something like /* Handle reads from and writes to MMIO region. */ before the function? -- Kirill A. Shutemov