All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Mickaël Salaün" <mic@digikod.net>
To: Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"H . Peter Anvin" <hpa@zytor.com>, Ingo Molnar <mingo@redhat.com>,
	Kees Cook <keescook@chromium.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>
Cc: "Mickaël Salaün" <mic@digikod.net>,
	"Alexander Graf" <graf@amazon.com>,
	"Chao Peng" <chao.p.peng@linux.intel.com>,
	"Edgecombe, Rick P" <rick.p.edgecombe@intel.com>,
	"Forrest Yuan Yu" <yuanyu@google.com>,
	"James Gowans" <jgowans@amazon.com>,
	"James Morris" <jamorris@linux.microsoft.com>,
	"John Andersen" <john.s.andersen@intel.com>,
	"Madhavan T . Venkataraman" <madvenka@linux.microsoft.com>,
	"Marian Rotariu" <marian.c.rotariu@gmail.com>,
	"Mihai Donțu" <mdontu@bitdefender.com>,
	"Nicușor Cîțu" <nicu.citu@icloud.com>,
	"Thara Gopinath" <tgopinath@microsoft.com>,
	"Trilok Soni" <quic_tsoni@quicinc.com>,
	"Wei Liu" <wei.liu@kernel.org>, "Will Deacon" <will@kernel.org>,
	"Yu Zhang" <yu.c.zhang@linux.intel.com>,
	"Zahra Tarkhani" <ztarkhani@microsoft.com>,
	"Ștefan Șicleru" <ssicleru@bitdefender.com>,
	dev@lists.cloudhypervisor.org, kvm@vger.kernel.org,
	linux-hardening@vger.kernel.org, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org, qemu-devel@nongnu.org,
	virtualization@lists.linux-foundation.org, x86@kernel.org,
	xen-devel@lists.xenproject.org
Subject: [RFC PATCH v2 12/19] x86: Implement the Memory Table feature to store arbitrary per-page data
Date: Sun, 12 Nov 2023 21:23:19 -0500	[thread overview]
Message-ID: <20231113022326.24388-13-mic@digikod.net> (raw)
In-Reply-To: <20231113022326.24388-1-mic@digikod.net>

From: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>

This feature can be used by a consumer to associate any arbitrary
pointer with a physical page. The feature implements a page table format
that mirrors the hardware page table. A leaf entry in the table points
to consumer data for that page.

The page table format has these advantages:

- The format allows for a sparse representation. This is useful since
  the physical address space can be large and is typically sparsely
  populated in a system.

- A consumer of this feature can choose to populate data just for the
  pages he is interested in.

- Information can be stored for large pages, if a consumer wishes.

For instance, for Heki, the guest kernel uses this to create permissions
counters for each guest physical page. The permissions counters reflects
the collective permissions for a guest physical page across all mappings
to that page. This allows the guest to request the hypervisor to set
only the necessary permissions for a guest physical page in the EPT
(instead of RWX).

This feature could also be used to improve the KVM's memory attribute
and the write page tracking.

We will support large page entries in mem_table in a future version
thanks to extra mem_table_ops's merge() and split() operations.

Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
Cc: Mickaël Salaün <mic@digikod.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---

Changes since v1:
* New patch and new file: kernel/mem_table.c
---
 arch/x86/kernel/setup.c   |   2 +
 include/linux/heki.h      |   1 +
 include/linux/mem_table.h |  55 ++++++++++
 kernel/Makefile           |   2 +
 kernel/mem_table.c        | 219 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 279 insertions(+)
 create mode 100644 include/linux/mem_table.h
 create mode 100644 kernel/mem_table.c

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index b098b1fa2470..e7ae46953ae4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -25,6 +25,7 @@
 #include <linux/static_call.h>
 #include <linux/swiotlb.h>
 #include <linux/random.h>
+#include <linux/mem_table.h>
 
 #include <uapi/linux/mount.h>
 
@@ -1315,6 +1316,7 @@ void __init setup_arch(char **cmdline_p)
 #endif
 
 	unwind_init();
+	mem_table_init(PG_LEVEL_4K);
 }
 
 #ifdef CONFIG_X86_32
diff --git a/include/linux/heki.h b/include/linux/heki.h
index 89cc9273a968..9b0c966c50d1 100644
--- a/include/linux/heki.h
+++ b/include/linux/heki.h
@@ -15,6 +15,7 @@
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/printk.h>
+#include <linux/slab.h>
 
 #ifdef CONFIG_HEKI
 
diff --git a/include/linux/mem_table.h b/include/linux/mem_table.h
new file mode 100644
index 000000000000..738bf12309f3
--- /dev/null
+++ b/include/linux/mem_table.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Memory table feature - Definitions.
+ *
+ * Copyright © 2023 Microsoft Corporation.
+ */
+
+#ifndef __MEM_TABLE_H__
+#define __MEM_TABLE_H__
+
+/* clang-format off */
+
+/*
+ * The MEM_TABLE bit is set on entries that point to an intermediate table.
+ * So, this bit is reserved. This means that pointers to consumer data must
+ * be at least two-byte aligned (so the MEM_TABLE bit is 0).
+ */
+#define MEM_TABLE		BIT(0)
+#define IS_LEAF(entry)		!((uintptr_t)entry & MEM_TABLE)
+
+/* clang-format on */
+
+/*
+ * A memory table is arranged exactly like a page table. The memory table
+ * configuration reflects the hardware page table configuration.
+ */
+
+/* Parameters at each level of the memory table hierarchy. */
+struct mem_table_level {
+	unsigned int number;
+	unsigned int nentries;
+	unsigned int shift;
+	unsigned int mask;
+};
+
+struct mem_table {
+	struct mem_table_level *level;
+	struct mem_table_ops *ops;
+	bool changed;
+	void *entries[];
+};
+
+/* Operations that need to be supplied by a consumer of memory tables. */
+struct mem_table_ops {
+	void (*free)(void *buf);
+};
+
+void mem_table_init(unsigned int base_level);
+struct mem_table *mem_table_alloc(struct mem_table_ops *ops);
+void mem_table_free(struct mem_table *table);
+void **mem_table_create(struct mem_table *table, phys_addr_t pa);
+void **mem_table_find(struct mem_table *table, phys_addr_t pa,
+		      unsigned int *level_num);
+
+#endif /* __MEM_TABLE_H__ */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..dcef03ec5c54 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -131,6 +131,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue.o
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_SPARSEMEM) += mem_table.o
+
 CFLAGS_stackleak.o += $(DISABLE_STACKLEAK_PLUGIN)
 obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
 KASAN_SANITIZE_stackleak.o := n
diff --git a/kernel/mem_table.c b/kernel/mem_table.c
new file mode 100644
index 000000000000..280a1b5ddde0
--- /dev/null
+++ b/kernel/mem_table.c
@@ -0,0 +1,219 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory table feature.
+ *
+ * This feature can be used by a consumer to associate any arbitrary pointer
+ * with a physical page. The feature implements a page table format that
+ * mirrors the hardware page table. A leaf entry in the table points to
+ * consumer data for that page.
+ *
+ * The page table format has these advantages:
+ *
+ *	- The format allows for a sparse representation. This is useful since
+ *	  the physical address space can be large and is typically sparsely
+ *	  populated in a system.
+ *
+ *	- A consumer of this feature can choose to populate data just for
+ *	  the pages he is interested in.
+ *
+ *	- Information can be stored for large pages, if a consumer wishes.
+ *
+ * For instance, for Heki, the guest kernel uses this to create permissions
+ * counters for each guest physical page. The permissions counters reflects the
+ * collective permissions for a guest physical page across all mappings to that
+ * page. This allows the guest to request the hypervisor to set only the
+ * necessary permissions for a guest physical page in the EPT (instead of RWX).
+ *
+ * Copyright © 2023 Microsoft Corporation.
+ */
+
+/*
+ * Memory table functions use recursion for simplicity. The recursion is bounded
+ * by the number of hardware page table levels.
+ *
+ * Locking is left to the caller of these functions.
+ */
+#include <linux/heki.h>
+#include <linux/mem_table.h>
+#include <linux/pgtable.h>
+
+#define TABLE(entry) ((void *)((uintptr_t)entry & ~MEM_TABLE))
+#define ENTRY(table) ((void *)((uintptr_t)table | MEM_TABLE))
+
+/*
+ * Within this feature, the table levels start from 0. On X86, the base level
+ * is not 0.
+ */
+unsigned int mem_table_base_level __ro_after_init;
+unsigned int mem_table_nlevels __ro_after_init;
+struct mem_table_level mem_table_levels[CONFIG_PGTABLE_LEVELS] __ro_after_init;
+
+void __init mem_table_init(unsigned int base_level)
+{
+	struct mem_table_level *level;
+	unsigned long shift, delta_shift;
+	int physmem_bits;
+	int i, max_levels;
+
+	/*
+	 * Compute the actual number of levels present. Compute the parameters
+	 * for each level.
+	 */
+	shift = ilog2(PAGE_SIZE);
+	physmem_bits = PAGE_SHIFT;
+	max_levels = CONFIG_PGTABLE_LEVELS;
+
+	for (i = 0; i < max_levels && physmem_bits < MAX_PHYSMEM_BITS; i++) {
+		level = &mem_table_levels[i];
+
+		switch (i) {
+		case 0:
+			level->nentries = PTRS_PER_PTE;
+			break;
+		case 1:
+			level->nentries = PTRS_PER_PMD;
+			break;
+		case 2:
+			level->nentries = PTRS_PER_PUD;
+			break;
+		case 3:
+			level->nentries = PTRS_PER_P4D;
+			break;
+		case 4:
+			level->nentries = PTRS_PER_PGD;
+			break;
+		}
+		level->number = i;
+		level->shift = shift;
+		level->mask = level->nentries - 1;
+
+		delta_shift = ilog2(level->nentries);
+		shift += delta_shift;
+		physmem_bits += delta_shift;
+	}
+	mem_table_nlevels = i;
+	mem_table_base_level = base_level;
+}
+
+struct mem_table *mem_table_alloc(struct mem_table_ops *ops)
+{
+	struct mem_table_level *level;
+	struct mem_table *table;
+
+	level = &mem_table_levels[mem_table_nlevels - 1];
+
+	table = kzalloc(struct_size(table, entries, level->nentries),
+			GFP_KERNEL);
+	if (table) {
+		table->level = level;
+		table->ops = ops;
+		return table;
+	}
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(mem_table_alloc);
+
+static void _mem_table_free(struct mem_table *table)
+{
+	struct mem_table_level *level = table->level;
+	void **entries = table->entries;
+	struct mem_table_ops *ops = table->ops;
+	int i;
+
+	for (i = 0; i < level->nentries; i++) {
+		if (!entries[i])
+			continue;
+		if (IS_LEAF(entries[i])) {
+			/* The consumer frees the pointer. */
+			ops->free(entries[i]);
+			continue;
+		}
+		_mem_table_free(TABLE(entries[i]));
+	}
+	kfree(table);
+}
+
+void mem_table_free(struct mem_table *table)
+{
+	_mem_table_free(table);
+}
+EXPORT_SYMBOL_GPL(mem_table_free);
+
+static void **_mem_table_find(struct mem_table *table, phys_addr_t pa,
+			      unsigned int *level_number)
+{
+	struct mem_table_level *level = table->level;
+	void **entries = table->entries;
+	unsigned long i;
+
+	i = (pa >> level->shift) & level->mask;
+
+	*level_number = level->number;
+	if (!entries[i])
+		return NULL;
+
+	if (IS_LEAF(entries[i]))
+		return &entries[i];
+
+	return _mem_table_find(TABLE(entries[i]), pa, level_number);
+}
+
+void **mem_table_find(struct mem_table *table, phys_addr_t pa,
+		      unsigned int *level_number)
+{
+	void **entry;
+
+	entry = _mem_table_find(table, pa, level_number);
+	level_number += mem_table_base_level;
+
+	return entry;
+}
+EXPORT_SYMBOL_GPL(mem_table_find);
+
+static void **_mem_table_create(struct mem_table *table, phys_addr_t pa)
+{
+	struct mem_table_level *level = table->level;
+	void **entries = table->entries;
+	unsigned long i;
+
+	table->changed = true;
+	i = (pa >> level->shift) & level->mask;
+
+	if (!level->number) {
+		/*
+		 * Reached the lowest level. Return a pointer to the entry
+		 * so that the consumer can populate it.
+		 */
+		return &entries[i];
+	}
+
+	/*
+	 * If the entry is NULL, then create a lower level table and make the
+	 * entry point to it. Or, if the entry is a leaf, then we need to
+	 * split the entry. In this case as well, create a lower level table
+	 * to split the entry.
+	 */
+	if (!entries[i] || IS_LEAF(entries[i])) {
+		struct mem_table *next;
+
+		/* Create next level table. */
+		level--;
+		next = kzalloc(struct_size(table, entries, level->nentries),
+			       GFP_KERNEL);
+		if (!next)
+			return NULL;
+
+		next->level = level;
+		next->ops = table->ops;
+		next->changed = true;
+		entries[i] = ENTRY(next);
+	}
+
+	return _mem_table_create(TABLE(entries[i]), pa);
+}
+
+void **mem_table_create(struct mem_table *table, phys_addr_t pa)
+{
+	return _mem_table_create(table, pa);
+}
+EXPORT_SYMBOL_GPL(mem_table_create);
-- 
2.42.1


  parent reply	other threads:[~2023-11-13  2:25 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-13  2:23 [RFC PATCH v2 00/19] Hypervisor-Enforced Kernel Integrity Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 01/19] virt: Introduce Hypervisor Enforced Kernel Integrity (Heki) Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 02/19] KVM: x86: Add new hypercall to lock control registers Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 03/19] KVM: x86: Add notifications for Heki policy configuration and violation Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 04/19] heki: Lock guest control registers at the end of guest kernel init Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 05/19] KVM: VMX: Add MBEC support Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 06/19] KVM: x86: Add kvm_x86_ops.fault_gva() Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 07/19] KVM: x86: Make memory attribute helpers more generic Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 08/19] KVM: x86: Extend kvm_vm_set_mem_attributes() with a mask Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 09/19] KVM: x86: Extend kvm_range_has_memory_attributes() with match_all Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 10/19] KVM: x86: Implement per-guest-page permissions Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 11/19] KVM: x86: Add new hypercall to set EPT permissions Mickaël Salaün
2023-11-13  4:45   ` kernel test robot
2023-11-13  2:23 ` Mickaël Salaün [this message]
2023-11-22  7:19   ` [RFC PATCH v2 12/19] x86: Implement the Memory Table feature to store arbitrary per-page data kernel test robot
2023-11-13  2:23 ` [RFC PATCH v2 13/19] heki: Implement a kernel page table walker Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 14/19] heki: x86: Initialize permissions counters for pages mapped into KVA Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 15/19] heki: x86: Initialize permissions counters for pages in vmap()/vunmap() Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 16/19] heki: x86: Update permissions counters when guest page permissions change Mickaël Salaün
2023-11-13  2:23 ` [RFC PATCH v2 17/19] heki: x86: Update permissions counters during text patching Mickaël Salaün
2023-11-13  8:19   ` Peter Zijlstra
2023-11-27 16:48     ` Madhavan T. Venkataraman
2023-11-27 20:08       ` Peter Zijlstra
2023-11-29 21:07         ` Madhavan T. Venkataraman
2023-11-30 11:33           ` Peter Zijlstra
2023-12-06 16:37             ` Madhavan T. Venkataraman
2023-12-06 18:51               ` Peter Zijlstra
2023-12-08 18:41                 ` Madhavan T. Venkataraman
2023-12-01  0:45           ` Edgecombe, Rick P
2023-12-06 16:41             ` Madhavan T. Venkataraman
2023-11-13  2:23 ` [RFC PATCH v2 18/19] heki: x86: Protect guest kernel memory using the KVM hypervisor Mickaël Salaün
2023-11-13  8:54   ` Peter Zijlstra
2023-11-27 17:05     ` Madhavan T. Venkataraman
2023-11-27 20:03       ` Peter Zijlstra
2023-11-29 19:47         ` Madhavan T. Venkataraman
2023-11-13  2:23 ` [RFC PATCH v2 19/19] virt: Add Heki KUnit tests Mickaël Salaün
2023-11-13  5:18 [RFC PATCH v2 14/19] heki: x86: Initialize permissions counters for pages mapped into KVA kernel test robot
2023-11-14  1:22 ` kernel test robot
2023-11-13  7:42 [RFC PATCH v2 10/19] KVM: x86: Implement per-guest-page permissions kernel test robot
2023-11-14  1:27 ` kernel test robot
2023-11-13  8:14 [RFC PATCH v2 18/19] heki: x86: Protect guest kernel memory using the KVM hypervisor kernel test robot
2023-11-14  1:30 ` kernel test robot
2023-11-13 12:37 [RFC PATCH v2 10/19] KVM: x86: Implement per-guest-page permissions kernel test robot
2023-11-14  1:29 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231113022326.24388-13-mic@digikod.net \
    --to=mic@digikod.net \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dev@lists.cloudhypervisor.org \
    --cc=graf@amazon.com \
    --cc=hpa@zytor.com \
    --cc=jamorris@linux.microsoft.com \
    --cc=jgowans@amazon.com \
    --cc=john.s.andersen@intel.com \
    --cc=keescook@chromium.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=madvenka@linux.microsoft.com \
    --cc=marian.c.rotariu@gmail.com \
    --cc=mdontu@bitdefender.com \
    --cc=mingo@redhat.com \
    --cc=nicu.citu@icloud.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quic_tsoni@quicinc.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=seanjc@google.com \
    --cc=ssicleru@bitdefender.com \
    --cc=tglx@linutronix.de \
    --cc=tgopinath@microsoft.com \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=wei.liu@kernel.org \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=yu.c.zhang@linux.intel.com \
    --cc=yuanyu@google.com \
    --cc=ztarkhani@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.