All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Isaku Yamahata <isaku.yamahata@gmail.com>,
	Chao Peng <chao.p.peng@linux.intel.com>,
	Hugh Dickins <hughd@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-arch@vger.kernel.org, linux-api@vger.kernel.org,
	linux-doc@vger.kernel.org, qemu-devel@nongnu.org,
	Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Sean Christopherson <seanjc@google.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Jeff Layton <jlayton@kernel.org>,
	"J . Bruce Fields" <bfields@fieldses.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>,
	Steven Price <steven.price@arm.com>,
	"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
	Vlastimil Babka <vbabka@suse.cz>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com,
	ak@linux.intel.com, david@redhat.com, aarcange@redhat.com,
	ddutile@redhat.com, dhildenb@redhat.com,
	Quentin Perret <qperret@google.com>,
	tabba@google.com, Michael Roth <michael.roth@amd.com>,
	mhocko@suse.com, Muchun Song <songmuchun@bytedance.com>,
	wei.w.wang@intel.com
Subject: Re: [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM
Date: Tue, 15 Nov 2022 17:36:54 +0300	[thread overview]
Message-ID: <20221115143654.rqpf72hzdtrd3xyw@box.shutemov.name> (raw)
In-Reply-To: <20221109155404.istawiyvwr3yffag@box.shutemov.name>

On Wed, Nov 09, 2022 at 06:54:04PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 07, 2022 at 04:41:41PM -0800, Isaku Yamahata wrote:
> > On Thu, Nov 03, 2022 at 05:43:52PM +0530,
> > Vishal Annapurve <vannapurve@google.com> wrote:
> > 
> > > On Tue, Oct 25, 2022 at 8:48 PM Chao Peng <chao.p.peng@linux.intel.com> wrote:
> > > >
> > > > This patch series implements KVM guest private memory for confidential
> > > > computing scenarios like Intel TDX[1]. If a TDX host accesses
> > > > TDX-protected guest memory, machine check can happen which can further
> > > > crash the running host system, this is terrible for multi-tenant
> > > > configurations. The host accesses include those from KVM userspace like
> > > > QEMU. This series addresses KVM userspace induced crash by introducing
> > > > new mm and KVM interfaces so KVM userspace can still manage guest memory
> > > > via a fd-based approach, but it can never access the guest memory
> > > > content.
> > > >
> > > > The patch series touches both core mm and KVM code. I appreciate
> > > > Andrew/Hugh and Paolo/Sean can review and pick these patches. Any other
> > > > reviews are always welcome.
> > > >   - 01: mm change, target for mm tree
> > > >   - 02-08: KVM change, target for KVM tree
> > > >
> > > > Given KVM is the only current user for the mm part, I have chatted with
> > > > Paolo and he is OK to merge the mm change through KVM tree, but
> > > > reviewed-by/acked-by is still expected from the mm people.
> > > >
> > > > The patches have been verified in Intel TDX environment, but Vishal has
> > > > done an excellent work on the selftests[4] which are dedicated for this
> > > > series, making it possible to test this series without innovative
> > > > hardware and fancy steps of building a VM environment. See Test section
> > > > below for more info.
> > > >
> > > >
> > > > Introduction
> > > > ============
> > > > KVM userspace being able to crash the host is horrible. Under current
> > > > KVM architecture, all guest memory is inherently accessible from KVM
> > > > userspace and is exposed to the mentioned crash issue. The goal of this
> > > > series is to provide a solution to align mm and KVM, on a userspace
> > > > inaccessible approach of exposing guest memory.
> > > >
> > > > Normally, KVM populates secondary page table (e.g. EPT) by using a host
> > > > virtual address (hva) from core mm page table (e.g. x86 userspace page
> > > > table). This requires guest memory being mmaped into KVM userspace, but
> > > > this is also the source where the mentioned crash issue can happen. In
> > > > theory, apart from those 'shared' memory for device emulation etc, guest
> > > > memory doesn't have to be mmaped into KVM userspace.
> > > >
> > > > This series introduces fd-based guest memory which will not be mmaped
> > > > into KVM userspace. KVM populates secondary page table by using a
> > > 
> > > With no mappings in place for userspace VMM, IIUC, looks like the host
> > > kernel will not be able to find the culprit userspace process in case
> > > of Machine check error on guest private memory. As implemented in
> > > hwpoison_user_mappings, host kernel tries to look at the processes
> > > which have mapped the pfns with hardware error.
> > > 
> > > Is there a modification needed in mce handling logic of the host
> > > kernel to immediately send a signal to the vcpu thread accessing
> > > faulting pfn backing guest private memory?
> > 
> > mce_register_decode_chain() can be used.  MCE physical address(p->mce_addr)
> > includes host key id in addition to real physical address.  By searching used
> > hkid by KVM, we can determine if the page is assigned to guest TD or not. If
> > yes, send SIGBUS.
> > 
> > kvm_machine_check() can be enhanced for KVM specific use.  This is before
> > memory_failure() is called, though.
> > 
> > any other ideas?
> 
> That's too KVM-centric. It will not work for other possible user of
> restricted memfd.
> 
> I tried to find a way to get it right: we need to get restricted memfd
> code info about corrupted page so it can invalidate its users. On the next
> request of the page the user will see an error. In case of KVM, the error
> will likely escalate to SIGBUS.
> 
> The problem is that core-mm code that handles memory failure knows nothing
> about restricted memfd. It only sees that the page belongs to a normal
> memfd.
> 
> AFAICS, there's no way to get it intercepted from the shim level. shmem
> code has to be patches. shmem_error_remove_page() has to call into
> restricted memfd code.
> 
> Hugh, are you okay with this? Or maybe you have a better idea?

Okay, here is what I've come up with. It doesn't touch shmem code, but
hooks up directly into memory-failure.c. It is still ugly, but should be
tolerable.

restrictedmem_error_page() loops over all restrictedmem inodes. It is
slow, but memory failure is not hot path (I hope).

Only build-tested. Chao, could you hook up ->error for KVM and get it
tested?

diff --git a/include/linux/restrictedmem.h b/include/linux/restrictedmem.h
index 9c37c3ea3180..c2700c5daa43 100644
--- a/include/linux/restrictedmem.h
+++ b/include/linux/restrictedmem.h
@@ -12,6 +12,8 @@ struct restrictedmem_notifier_ops {
 				 pgoff_t start, pgoff_t end);
 	void (*invalidate_end)(struct restrictedmem_notifier *notifier,
 			       pgoff_t start, pgoff_t end);
+	void (*error)(struct restrictedmem_notifier *notifier,
+			       pgoff_t start, pgoff_t end);
 };
 
 struct restrictedmem_notifier {
@@ -34,6 +36,8 @@ static inline bool file_is_restrictedmem(struct file *file)
 	return file->f_inode->i_sb->s_magic == RESTRICTEDMEM_MAGIC;
 }
 
+void restrictedmem_error_page(struct page *page, struct address_space *mapping);
+
 #else
 
 static inline void restrictedmem_register_notifier(struct file *file,
@@ -57,6 +61,11 @@ static inline bool file_is_restrictedmem(struct file *file)
 	return false;
 }
 
+static inline void restrictedmem_error_page(struct page *page,
+					    struct address_space *mapping)
+{
+}
+
 #endif /* CONFIG_RESTRICTEDMEM */
 
 #endif /* _LINUX_RESTRICTEDMEM_H */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e7ac570dda75..ee85e46c6992 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -62,6 +62,7 @@
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
+#include <linux/restrictedmem.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -939,6 +940,8 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}
 
+	restrictedmem_error_page(p, mapping);
+
 	/*
 	 * The shmem page is kept in page cache instead of truncating
 	 * so is expected to have an extra refcount after error-handling.
diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c
index e5bf8907e0f8..0dcdff0d8055 100644
--- a/mm/restrictedmem.c
+++ b/mm/restrictedmem.c
@@ -29,6 +29,18 @@ static void restrictedmem_notifier_invalidate(struct restrictedmem_data *data,
 	mutex_unlock(&data->lock);
 }
 
+static void restrictedmem_notifier_error(struct restrictedmem_data *data,
+				 pgoff_t start, pgoff_t end)
+{
+	struct restrictedmem_notifier *notifier;
+
+	mutex_lock(&data->lock);
+	list_for_each_entry(notifier, &data->notifiers, list) {
+			notifier->ops->error(notifier, start, end);
+	}
+	mutex_unlock(&data->lock);
+}
+
 static int restrictedmem_release(struct inode *inode, struct file *file)
 {
 	struct restrictedmem_data *data = inode->i_mapping->private_data;
@@ -248,3 +260,30 @@ int restrictedmem_get_page(struct file *file, pgoff_t offset,
 	return 0;
 }
 EXPORT_SYMBOL_GPL(restrictedmem_get_page);
+
+void restrictedmem_error_page(struct page *page, struct address_space *mapping)
+{
+	struct super_block *sb = restrictedmem_mnt->mnt_sb;
+	struct inode *inode, *next;
+
+	if (!shmem_mapping(mapping))
+		return;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+		struct restrictedmem_data *data = inode->i_mapping->private_data;
+		struct file *memfd = data->memfd;
+
+		if (memfd->f_mapping == mapping) {
+			pgoff_t start, end;
+
+			spin_unlock(&sb->s_inode_list_lock);
+
+			start = page->index;
+			end = start + thp_nr_pages(page);
+			restrictedmem_notifier_error(data, start, end);
+			return;
+		}
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+}
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

  reply	other threads:[~2022-11-15 14:37 UTC|newest]

Thread overview: 101+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-25 15:13 [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-10-25 15:13 ` [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory Chao Peng
2022-10-26 17:31   ` Isaku Yamahata
2022-10-28  6:12     ` Chao Peng
2022-10-27 10:20   ` Fuad Tabba
2022-10-31 17:47   ` Michael Roth
2022-11-01 11:37     ` Chao Peng
2022-11-01 15:19       ` Michael Roth
2022-11-01 19:30         ` Michael Roth
2022-11-02 14:53           ` Chao Peng
2022-11-02 21:19             ` Michael Roth
2022-11-14 14:02         ` Vlastimil Babka
2022-11-14 15:28           ` Kirill A. Shutemov
2022-11-14 22:16             ` Michael Roth
2022-11-15  9:48               ` Chao Peng
2022-11-14 22:16           ` Michael Roth
2022-11-02 21:14     ` Kirill A. Shutemov
2022-11-02 21:26       ` Michael Roth
2022-11-02 22:07       ` Michael Roth
2022-11-03 16:30         ` Kirill A. Shutemov
2022-11-29  0:06   ` Michael Roth
2022-11-29 11:21     ` Kirill A. Shutemov
2022-11-29 11:39       ` David Hildenbrand
2022-11-29 13:59         ` Chao Peng
2022-11-29 13:58       ` Chao Peng
2022-11-29  0:37   ` Michael Roth
2022-11-29 14:06     ` Chao Peng
2022-11-29 19:06       ` Michael Roth
2022-11-29 19:18         ` Michael Roth
2022-11-30  9:39           ` Chao Peng
2022-11-30 14:31             ` Michael Roth
2022-11-29 18:01     ` Vishal Annapurve
2022-12-02  2:16   ` Vishal Annapurve
2022-12-02  6:49     ` Chao Peng
2022-12-02 13:44       ` Kirill A . Shutemov
2022-10-25 15:13 ` [PATCH v9 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-10-27 10:25   ` Fuad Tabba
2022-10-28  7:04   ` Xiaoyao Li
2022-10-31 14:14     ` Chao Peng
2022-11-14 16:04   ` Alex Bennée
2022-11-15  9:29     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-10-25 15:26   ` Peter Maydell
2022-10-25 16:17     ` Sean Christopherson
2022-10-27 10:27   ` Fuad Tabba
2022-10-28  6:14     ` Chao Peng
2022-11-15 16:56   ` Alex Bennée
2022-11-16  3:14     ` Chao Peng
2022-11-16 19:03       ` Alex Bennée
2022-11-17 13:45         ` Chao Peng
2022-11-17 15:08           ` Alex Bennée
2022-11-18  1:32             ` Chao Peng
2022-11-18 13:23               ` Alex Bennée
2022-11-18 15:59                 ` Sean Christopherson
2022-11-22  9:50                   ` Chao Peng
2022-11-23 18:02                     ` Sean Christopherson
2022-11-16 18:15   ` Andy Lutomirski
2022-11-16 18:48     ` Sean Christopherson
2022-11-17 13:42       ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
2022-10-27 10:29   ` Fuad Tabba
2022-11-04  2:28     ` Chao Peng
2022-11-04 22:29       ` Sean Christopherson
2022-11-08  7:16         ` Chao Peng
2022-11-10 17:53           ` Sean Christopherson
2022-11-10 20:06   ` Sean Christopherson
2022-11-11  8:27     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
2022-10-27 10:31   ` Fuad Tabba
2022-11-03 23:04   ` Sean Christopherson
2022-11-04  8:28     ` Chao Peng
2022-11-04 21:19       ` Sean Christopherson
2022-11-08  8:24         ` Chao Peng
2022-11-08  1:35   ` Yuan Yao
2022-11-08  9:41     ` Chao Peng
2022-11-09  5:52       ` Yuan Yao
2022-11-16 22:24   ` Sean Christopherson
2022-11-17 13:20     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-10-26 20:46   ` Isaku Yamahata
2022-10-28  6:38     ` Chao Peng
2022-11-08 12:08   ` Yuan Yao
2022-11-09  4:13     ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 7/8] KVM: Handle page fault for private memory Chao Peng
2022-10-26 21:54   ` Isaku Yamahata
2022-10-28  6:55     ` Chao Peng
2022-11-01  0:02       ` Isaku Yamahata
2022-11-01 11:38         ` Chao Peng
2022-11-16 20:50   ` Ackerley Tng
2022-11-16 22:13     ` Sean Christopherson
2022-11-17 13:25       ` Chao Peng
2022-10-25 15:13 ` [PATCH v9 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-10-27 10:31   ` Fuad Tabba
2022-11-03 12:13 ` [PATCH v9 0/8] KVM: mm: fd-based approach for supporting KVM Vishal Annapurve
2022-11-08  0:41   ` Isaku Yamahata
2022-11-09 15:54     ` Kirill A. Shutemov
2022-11-15 14:36       ` Kirill A. Shutemov [this message]
2022-11-14 11:43 ` Alex Bennée
2022-11-16  5:00   ` Chao Peng
2022-11-16  9:40     ` Alex Bennée
2022-11-17 14:16       ` Chao Peng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221115143654.rqpf72hzdtrd3xyw@box.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfields@fieldses.org \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=dhildenb@redhat.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=jun.nakajima@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=mhocko@suse.com \
    --cc=michael.roth@amd.com \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=songmuchun@bytedance.com \
    --cc=steven.price@arm.com \
    --cc=tabba@google.com \
    --cc=tglx@linutronix.de \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=wei.w.wang@intel.com \
    --cc=x86@kernel.org \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.