From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D5ED1C433ED for ; Wed, 7 Apr 2021 14:56:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7542761382 for ; Wed, 7 Apr 2021 14:56:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7542761382 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F2D306B007D; Wed, 7 Apr 2021 10:56:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EDEF56B007E; Wed, 7 Apr 2021 10:56:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D7F8C6B0080; Wed, 7 Apr 2021 10:56:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0020.hostedemail.com [216.40.44.20]) by kanga.kvack.org (Postfix) with ESMTP id BA5776B007D for ; Wed, 7 Apr 2021 10:56:08 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 7AA938248D51 for ; Wed, 7 Apr 2021 14:56:08 +0000 (UTC) X-FDA: 78005871216.35.8EBA4A4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf15.hostedemail.com (Postfix) with ESMTP id EBDDCA0000FD for ; Wed, 7 Apr 2021 14:56:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1617807367; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3K37Ana68yLtXTYXYTNdCNXI4iqTOpD+NBf+2SOYuQo=; b=hcSvsvY1yML6BZ1BiSph2Bf6PWKJNqPXKyhrhk/5WFl1ruKtnTAS4YrQjUrmeNQWyJTm4p F72QSdRU8dAZ2rzq8H7ilEwgKNYyaya8taSGjFIO2+dRvJ4BZlAfNWj+xLkZKpPFGW52dN kPknYZbvvN60NE5p3vys4LnRz76DwgA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-393-_q0AcofYMQSkxK5o71fQMw-1; Wed, 07 Apr 2021 10:56:03 -0400 X-MC-Unique: _q0AcofYMQSkxK5o71fQMw-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8B1D783DD2A; Wed, 7 Apr 2021 14:55:59 +0000 (UTC) Received: from [10.36.114.68] (ovpn-114-68.ams2.redhat.com [10.36.114.68]) by smtp.corp.redhat.com (Postfix) with ESMTP id E9FC019D9F; Wed, 7 Apr 2021 14:55:55 +0000 (UTC) To: "Kirill A. Shutemov" , Dave Hansen , Andy Lutomirski , Peter Zijlstra , Sean Christopherson , Jim Mattson Cc: David Rientjes , "Edgecombe, Rick P" , "Kleen, Andi" , "Yamahata, Isaku" , x86@kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" , Oscar Salvador , Naoya Horiguchi References: <20210402152645.26680-1-kirill.shutemov@linux.intel.com> <20210402152645.26680-8-kirill.shutemov@linux.intel.com> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages Message-ID: <5e934d94-414c-90de-c58e-34456e4ab1cf@redhat.com> Date: Wed, 7 Apr 2021 16:55:54 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210402152645.26680-8-kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: EBDDCA0000FD X-Stat-Signature: ghabuzd3j8tijz9dhx9jjh6sidjh5s3j Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf15; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1617807366-497205 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 02.04.21 17:26, Kirill A. Shutemov wrote: > TDX architecture aims to provide resiliency against confidentiality and > integrity attacks. Towards this goal, the TDX architecture helps enforc= e > the enabling of memory integrity for all TD-private memory. >=20 > The CPU memory controller computes the integrity check value (MAC) for > the data (cache line) during writes, and it stores the MAC with the > memory as meta-data. A 28-bit MAC is stored in the ECC bits. >=20 > Checking of memory integrity is performed during memory reads. If > integrity check fails, CPU poisones cache line. >=20 > On a subsequent consumption (read) of the poisoned data by software, > there are two possible scenarios: >=20 > - Core determines that the execution can continue and it treats > poison with exception semantics signaled as a #MCE >=20 > - Core determines execution cannot continue,and it does an unbreakabl= e > shutdown >=20 > For more details, see Chapter 14 of Intel TDX Module EAS[1] >=20 > As some of integrity check failures may lead to system shutdown host > kernel must not allow any writes to TD-private memory. This requirment > clashes with KVM design: KVM expects the guest memory to be mapped into > host userspace (e.g. QEMU). >=20 > This patch aims to start discussion on how we can approach the issue. >=20 > For now I intentionally keep TDX out of picture here and try to find a > generic way to unmap KVM guest memory from host userspace. Hopefully, i= t > makes the patch more approachable. And anyone can try it out. >=20 > To the proposal: >=20 > Looking into existing codepaths I've discovered that we already have > semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap > entries in page tables: >=20 > - If an application touches a page mapped with the SWP_HWPOISON, it = will > get SIGBUS. >=20 > - GUP will fail with -EFAULT; >=20 > Access the poisoned memory via page cache doesn't match required > semantics right now, but it shouldn't be too hard to make it work: > access to poisoned dirty pages should give -EIO or -EHWPOISON. >=20 > My idea is that we can mark page as poisoned when we make it TD-private > and replace all PTEs that map the page with SWP_HWPOISON. It looks quite hacky (well, what did I expect from an RFC :) ) you can=20 no longer distinguish actually poisoned pages from "temporarily=20 poisoned" pages. FOLL_ALLOW_POISONED sounds especially nasty and=20 dangerous - "I want to read/write a poisoned page, trust me, I know=20 what I am doing". Storing the state for each individual page initially sounded like the=20 right thing to do, but I wonder if we couldn't handle this on a per-VMA=20 level. You can just remember the handful of shared ranges internally=20 like you do right now AFAIU. From what I get, you want a way to 1. Unmap pages from the user space page tables. 2. Disallow re-faulting of the protected pages into the page tables. On=20 user space access, you want to deliver some signal (e.g., SIGBUS). 3. Allow selected users to still grab the pages (esp. KVM to fault them=20 into the page tables). 4. Allow access to currently shared specific pages from user space. Right now, you achieve 1. Via try_to_unmap() 2. TestSetPageHWPoison 3. TBD (e.g., FOLL_ALLOW_POISONED) 4. ClearPageHWPoison() If we could bounce all writes to shared pages through the kernel, things=20 could end up a little easier. Some very rough idea: We could let user space setup VM memory as mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating=20 protected memory (I assume via a KVM ioctl), make sure the VMAs cannot=20 be set to PROT_WRITE anymore. This would already properly unmap and=20 deliver a SIGSEGV when trying to write from user space. You could then still access the pages, e.g., via FOLL_FORCE or a new=20 fancy flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This=20 would allow an ioctl to write page content and to map the pages into NPTs= . As an extension, we could think about (re?)mapping some shared pages=20 read|write. The question is how to synchronize with user space. I have no idea how expensive would be bouncing writes (and reads?)=20 through the kernel. Did you ever experiment with that/evaluate that? --=20 Thanks, David / dhildenb