linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/5] Add TDX Guest Attestation support
@ 2022-06-09  2:52 Kuppuswamy Sathyanarayanan
  2022-06-09  2:52 ` [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
                   ` (5 more replies)
  0 siblings, 6 replies; 77+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-06-09  2:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

Hi All,

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. VM guest with TDX support is called
as TD Guest.

In TD Guest, the attestation process is used to verify the 
trustworthiness of TD guest to the 3rd party servers. Such attestation
process is required by 3rd party servers before sending sensitive
information to TD guests. One usage example is to get encryption keys
from the key server for mounting the encrypted rootfs or secondary drive.
    
Following patches add the attestation support to TDX guest which
includes attestation user interface driver and related hypercall support.

Any distribution enabling TDX is also expected to need attestation. So
enable it by default with TDX guest support. The compiled size is
quite small (~500 bytes).

Changes since v7:
 * Removed exports symbols for tdx_setup_ev_notify_handler() and
   tdx_remove_ev_notify_handler().
 * Changed "struct quote_buf *buf" member in "struct quote_entry"
   from pointer to embed object.
 * Rebased on top of v5.19-rc1.

Changes since v6:
 * Fixed race between wait_for_completion_*() and
   quote_callback_handler() in tdx_get_quote() when user terminates the
   request.
 * Fixed commit log and comments.

Changes since v5:
 * Added support for parallel GetQuote requests.
 * Add noalias variants of set_memory_*crypted() functions to
   changes page attribute without touching direct map.
 * Made set_memory_*crypted() functions vmalloc address compatible.
 * Use vmap()/set_memory_*crypted() functions to share/unshare
   memory without touching the direct map.
 * Add support to let driver handle the memory cleanup for the
   early termination of user requests.
 * Removed unused headers in attest.c
 * Fixed commit log and comments as per review comments.

Changes since v4:
 * Removed platform driver model in attestation driver and used
   miscdevice and initcall approach.
 * Since dma_alloc*() APIs require a valid device reference,
   replaced it with __get_free_pages() and set_memory_decrypted()
   for quote memory allocation.
 * Removed tdx_mcall_tdreport() and moved TDG.MR.REPORT TDCALL code
   to tdx_get_report().
 * Used kmalloc() for TDREPORT memory allocation instead of
   get_zeroed_page().
 * Returned -EINVAL in default case of tdx_attest_ioctl().
 * Added struct tdx_report_req to explicitly mention the
   TDX_CMD_GET_REPORT IOCTL argument.
 * Removed tdx_get_quote_hypercall() and moved hypercall code to
   attestation driver itself.
 * Removed GetQuote timeout support (since it is not defined in
   spec)
 * Added support to check for spurious callback interrupt in GetQuote
   request.
 * Fixed commit log and comments as per review suggestions.
   

Changes since v3:
 * Moved the attestation driver from platform/x86 to arch/x86/coco/tdx/ and
   renamed intel_tdx_attest.c to attest.c.
 * Dropped CONFIG_INTEL_TDX_ATTESTATION and added support to compile
   attestation changes with CONFIG_INTEL_TDX_GUEST option.
 * Merged patch titled "x86/tdx: Add tdx_mcall_tdreport() API support" and
   "platform/x86: intel_tdx_attest: Add TDX Guest attestation interface" into
   a single patch.
 * Moved GetQuote IOCTL support changes from patch titled "platform/x86:
   intel_tdx_attest: Add TDX Guest attestation interface driver" to a
   separate patch.
 * Removed 8K size restriction when requesting quote, and added support
   to let userspace decide the quote size.
 * Added support to allow attestation agent configure quote generation
   timeout value.
 * Fixed commit log and comments as per review comments.

Changes since v2:
 * As per Han's suggestion, modified the attestation driver to use
   platform device driver model.
 * Modified tdx_hcall_get_quote() and tdx_mcall_tdreport() APIs to
   return TDCALL error code instead of generic error info (like -EIO).
 * Removed attestation test app patch from this series to simplify
   the patchset and review process. Test app patches will be submitted
   once attestation support patches are merged.
 * Since patches titled "x86/tdx: Add SetupEventNotifyInterrupt TDX
   hypercall support" and "x86/tdx: Add TDX Guest event notify
   interrupt vector support" are related, combining them into a
   single patch.

Changes since v1:
 * Moved test driver from "tools/tdx/attest/tdx-attest-test.c" to
   "tools/arch/x86/tdx/attest/tdx-attest-test.c" as per Hans review
   suggestion.
 * Minor commit log and comment fixes in patches titled
   "x86/tdx: Add tdx_mcall_tdreport() API support" and "x86/tdx:
   Add tdx_hcall_get_quote() API support"
 * Extended tdx_hcall_get_quote() API to accept GPA length as argument
   to accomodate latest TDQUOTE TDVMCALL related specification update.
 * Added support for tdx_setup_ev_notify_handler() and
   tdx_remove_ev_notify_handler() in patch titled "x86/tdx: Add TDX
   Guest event notify interrupt vector support"


Kuppuswamy Sathyanarayanan (5):
  x86/tdx: Add TDX Guest attestation interface driver
  x86/tdx: Add TDX Guest event notify interrupt support
  x86/mm: Make tdx_enc_status_changed() vmalloc address compatible
  x86/mm: Add noalias variants of set_memory_*crypted() functions
  x86/tdx: Add Quote generation support

 arch/x86/coco/tdx/Makefile         |   2 +-
 arch/x86/coco/tdx/attest.c         | 422 +++++++++++++++++++++++++++++
 arch/x86/coco/tdx/tdx.c            |  82 +++++-
 arch/x86/include/asm/hardirq.h     |   3 +
 arch/x86/include/asm/idtentry.h    |   4 +
 arch/x86/include/asm/irq_vectors.h |   7 +-
 arch/x86/include/asm/set_memory.h  |   2 +
 arch/x86/include/asm/tdx.h         |   4 +
 arch/x86/include/uapi/asm/tdx.h    |  87 ++++++
 arch/x86/kernel/irq.c              |   7 +
 arch/x86/mm/pat/set_memory.c       |  26 +-
 11 files changed, 636 insertions(+), 10 deletions(-)
 create mode 100644 arch/x86/coco/tdx/attest.c
 create mode 100644 arch/x86/include/uapi/asm/tdx.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
@ 2022-06-09  2:52 ` Kuppuswamy Sathyanarayanan
  2022-06-24 16:51   ` Dave Hansen
  2022-06-09  2:52 ` [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 77+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-06-09  2:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

In TDX guest, attestation is used to verify the trustworthiness of a TD
to other entities before provisioning secrets to the TD.

One usage example is, when a TD guest uses encrypted drive and if the
decryption keys required to access the drive are stored in a secure 3rd
party keyserver, the key server can use attestation to verify TD's
trustworthiness and release the decryption keys to the TD.

The attestation process consists of two steps: TDREPORT generation and
Quote generation.

TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
the TDX module which contains TD-specific information (such as TD
measurements), platform security version, and the MAC to protect the
integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
get the TDREPORT from the TDX module. A user-provided 64-Byte
REPORTDATA is used as input and included in the TDREPORT. Typically it
can be some nonce provided by attestation service so the TDREPORT can
be verified uniquely. More details about TDREPORT can be found in
Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".

TDREPORT can only be verified on local platform as the MAC key is bound
to the platform. To support remote verification of the TDREPORT, TDX
leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT locally
and convert it to a remote verifiable Quote.

After getting the TDREPORT, the second step of the attestation process
is to send it to the QE to generate the Quote. TDX doesn't support SGX
inside the TD, so the QE can be deployed in the host, or in another
legacy VM with SGX support. How to send the TDREPORT to QE and receive
the Quote is implementation and deployment specific.

Implement a basic attestation driver to allow TD userspace to get the
TDREPORT. The TD userspace attestation software can get the TDREPORT
and then choose whatever communication channel available (i.e. vsock)
to send the TDREPORT to QE and receive the Quote.

Also note that explicit access permissions are not enforced in this
driver because the quote and measurements are not a secret. However
the access permissions of the device node can be used to set any
desired access policy. The udev default is usually root access
only.

Operations like getting TDREPORT or Quote generation involves sending
a blob of data as input and getting another blob of data as output. It
was considered to use a sysfs interface for this, but it doesn't fit
well into the standard sysfs model for configuring values. It would be
possible to do read/write on files, but it would need multiple file
descriptors, which would be somewhat messy. IOCTLs seems to be the best
fitting and simplest model for this use case. Also, the REPORTDATA used
in TDREPORT generation can possibly come from attestation service to
uniquely verify the Quote (like per instance verification). In such
case, since REPORTDATA is a secret, using sysfs to share it is insecure
compared to sending it via IOCTL.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/Makefile      |   2 +-
 arch/x86/coco/tdx/attest.c      | 118 ++++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/tdx.h |  42 ++++++++++++
 3 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/coco/tdx/attest.c
 create mode 100644 arch/x86/include/uapi/asm/tdx.h

diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..d2db3e6770e5 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdcall.o attest.o
diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
new file mode 100644
index 000000000000..24db0bad4923
--- /dev/null
+++ b/arch/x86/coco/tdx/attest.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * attest.c - TDX guest attestation interface driver.
+ *
+ * Implements user interface to trigger attestation process.
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *
+ */
+
+#define pr_fmt(fmt) "x86/tdx: attest: " fmt
+
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <asm/tdx.h>
+#include <uapi/asm/tdx.h>
+
+#define DRIVER_NAME "tdx-attest"
+
+/* TDREPORT module call leaf ID */
+#define TDX_GET_REPORT			4
+
+static struct miscdevice miscdev;
+
+static long tdx_get_report(void __user *argp)
+{
+	void *reportdata = NULL, *tdreport = NULL;
+	long ret;
+
+	/* Allocate buffer space for REPORTDATA */
+	reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
+	if (!reportdata)
+		return -ENOMEM;
+
+	/* Allocate buffer space for TDREPORT */
+	tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
+	if (!tdreport) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Copy REPORTDATA from the user buffer */
+	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	/*
+	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
+	 *
+	 * Get the TDREPORT using REPORTDATA as input. Refer to
+	 * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
+	 * Specification for detailed information.
+	 */
+	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
+				virt_to_phys(reportdata), 0, 0, NULL);
+	if (ret) {
+		pr_debug("TDREPORT TDCALL failed, status:%lx\n", ret);
+		ret = -EIO;
+		goto out;
+	}
+
+	/* Copy TDREPORT back to the user buffer */
+	if (copy_to_user(argp, tdreport, TDX_REPORT_LEN))
+		ret = -EFAULT;
+
+out:
+	kfree(reportdata);
+	kfree(tdreport);
+	return ret;
+}
+
+static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
+			     unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	long ret = -EINVAL;
+
+	switch (cmd) {
+	case TDX_CMD_GET_REPORT:
+		ret = tdx_get_report(argp);
+		break;
+	default:
+		pr_debug("cmd %d not supported\n", cmd);
+		break;
+	}
+
+	return ret;
+}
+
+static const struct file_operations tdx_attest_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= tdx_attest_ioctl,
+	.llseek		= no_llseek,
+};
+
+static int __init tdx_attestation_init(void)
+{
+	int ret;
+
+	/* Make sure we are in a valid TDX platform */
+	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return -EIO;
+
+	miscdev.name = DRIVER_NAME;
+	miscdev.minor = MISC_DYNAMIC_MINOR;
+	miscdev.fops = &tdx_attest_fops;
+
+	ret = misc_register(&miscdev);
+	if (ret) {
+		pr_err("misc device registration failed\n");
+		return ret;
+	}
+
+	return 0;
+}
+device_initcall(tdx_attestation_init)
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
new file mode 100644
index 000000000000..8b57dea67eab
--- /dev/null
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_TDX_H
+#define _UAPI_ASM_X86_TDX_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* Length of the REPORTDATA used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORTDATA_LEN		64
+
+/* Length of TDREPORT used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORT_LEN			1024
+
+/**
+ * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
+ *
+ * @reportdata : User-defined 64-Byte REPORTDATA to be included into
+ *		 TDREPORT. Typically it can be some nonce provided by
+ *		 attestation service, so the generated TDREPORT can be
+ *		 uniquely verified.
+ * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
+ *		 TDX_REPORT_LEN.
+ *
+ * Used in TDX_CMD_GET_REPORT IOCTL request.
+ */
+struct tdx_report_req {
+	union {
+		__u8 reportdata[TDX_REPORTDATA_LEN];
+		__u8 tdreport[TDX_REPORT_LEN];
+	};
+};
+
+/*
+ * TDX_CMD_GET_REPORT - Get TDREPORT using TDCALL[TDG.MR.REPORT]
+ *
+ * Return 0 on success, -EIO on TDCALL execution failure, and
+ * standard errno on other general error cases.
+ *
+ */
+#define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
+
+#endif /* _UAPI_ASM_X86_TDX_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
  2022-06-09  2:52 ` [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
@ 2022-06-09  2:52 ` Kuppuswamy Sathyanarayanan
  2022-06-20 12:33   ` Kai Huang
  2022-06-09  2:52 ` [PATCH v8 3/5] x86/mm: Make tdx_enc_status_changed() vmalloc address compatible Kuppuswamy Sathyanarayanan
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 77+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-06-09  2:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

Host-guest event notification via configured interrupt vector is useful
in cases where a guest makes an asynchronous request and needs a
callback from the host to indicate the completion or to let the host
notify the guest about events like device removal. One usage example is,
callback requirement of GetQuote asynchronous hypercall.

In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
guest to specify which interrupt vector to use as an event-notify
vector to the VMM. Details about the SetupEventNotifyInterrupt
hypercall can be found in TDX Guest-Host Communication Interface
(GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
Add a tdx_hcall_set_notify_intr() helper function to implement the
SetupEventNotifyInterrupt hypercall.

Reserve 0xec IRQ vector address for TDX guest to receive the event
completion notification from VMM. Also add related IDT handler to
process the notification event.

Add support to track the notification event status via
/proc/interrupts.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c            | 71 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/hardirq.h     |  3 ++
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/include/asm/irq_vectors.h |  7 ++-
 arch/x86/include/asm/tdx.h         |  4 ++
 arch/x86/kernel/irq.c              |  7 +++
 6 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 03deb4d6920d..94542876f26a 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -11,6 +11,10 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/apic.h>
+#include <asm/idtentry.h>
+#include <asm/irq_regs.h>
+#include <asm/desc.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
@@ -19,6 +23,7 @@
 
 /* TDX hypercall Leaf IDs */
 #define TDVMCALL_MAP_GPA		0x10001
+#define TDVMCALL_SETUP_NOTIFY_INTR	0x10004
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -34,6 +39,26 @@
 #define VE_GET_PORT_NUM(e)	((e) >> 16)
 #define VE_IS_IO_STRING(e)	((e) & BIT(4))
 
+/*
+ * Handler used to report notifications about
+ * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
+ * used only by the attestation driver. So, race condition
+ * with read/write operation is not considered.
+ */
+static void (*tdx_event_notify_handler)(void);
+
+/* Helper function to register tdx_event_notify_handler */
+void tdx_setup_ev_notify_handler(void (*handler)(void))
+{
+	tdx_event_notify_handler = handler;
+}
+
+/* Helper function to unregister tdx_event_notify_handler */
+void tdx_remove_ev_notify_handler(void)
+{
+	tdx_event_notify_handler = NULL;
+}
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -98,6 +123,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
 }
 
+/* TDX guest event notification handler */
+DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	inc_irq_stat(irq_tdx_event_notify_count);
+
+	if (tdx_event_notify_handler)
+		tdx_event_notify_handler();
+
+	ack_APIC_irq();
+
+	set_irq_regs(old_regs);
+}
+
+/*
+ * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
+ *
+ * @vector: Vector address to be used for notification.
+ *
+ * return 0 on success or failure error number.
+ */
+static long tdx_hcall_set_notify_intr(u8 vector)
+{
+	/* Minimum vector value allowed is 32 */
+	if (vector < 32)
+		return -EINVAL;
+
+	/*
+	 * Register callback vector address with VMM. More details
+	 * about the ABI can be found in TDX Guest-Host-Communication
+	 * Interface (GHCI), sec titled
+	 * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
+	 */
+	if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
+		return -EIO;
+
+	return 0;
+}
+
 static u64 get_cc_mask(void)
 {
 	struct tdx_module_output out;
@@ -688,5 +753,11 @@ void __init tdx_early_init(void)
 	x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
 	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
 
+	alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
+			asm_sysvec_tdx_event_notify);
+
+	if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
+		pr_warn("Setting event notification interrupt failed\n");
+
 	pr_info("Guest detected\n");
 }
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 275e7fd20310..582deff56210 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+	unsigned int irq_tdx_event_notify_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..655086dd940e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_xen_hvm_callback);
 DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_kvm_asyncpf_interrupt);
 #endif
 
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR,	sysvec_tdx_event_notify);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..82ac0c0a34b1 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -104,7 +104,12 @@
 #define HYPERV_STIMER0_VECTOR		0xed
 #endif
 
-#define LOCAL_TIMER_VECTOR		0xec
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+/* Vector on which TDX Guest event notification is delivered */
+#define TDX_GUEST_EVENT_NOTIFY_VECTOR	0xec
+#endif
+
+#define LOCAL_TIMER_VECTOR		0xeb
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..eb4db837cc44 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,10 @@ void tdx_safe_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
+void tdx_setup_ev_notify_handler(void (*handler)(void));
+
+void tdx_remove_ev_notify_handler(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..a96ecd866723 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+	seq_printf(p, "%*s: ", prec, "TGN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->irq_tdx_event_notify_count);
+	seq_puts(p, "  TDX Guest event notification\n");
 #endif
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v8 3/5] x86/mm: Make tdx_enc_status_changed() vmalloc address compatible
  2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
  2022-06-09  2:52 ` [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
  2022-06-09  2:52 ` [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
@ 2022-06-09  2:52 ` Kuppuswamy Sathyanarayanan
  2022-06-09  2:52 ` [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions Kuppuswamy Sathyanarayanan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-06-09  2:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

set_memory_*crypted() APIs are used to change encryption or decryption
page attributes for the given address. It also by default support the
conversion for the vmalloc'ed memory address.

In TDX Guest, tdx_enc_status_changed() function is triggered by
set_memory_*crypted() APIs when converting memory from/to shared or
private. Internally this function uses __pa() for physical address
conversion, which breaks the vmalloc address compatibility of the
set_memory_*crypted() APIs.

So add support to fix the vmalloc'ed address compatibility issue.

Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 94542876f26a..445f58547776 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -15,6 +15,7 @@
 #include <asm/idtentry.h>
 #include <asm/irq_regs.h>
 #include <asm/desc.h>
+#include <asm/io.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
@@ -678,8 +679,14 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
  */
 static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
 {
-	phys_addr_t start = __pa(vaddr);
-	phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+	phys_addr_t start, end;
+
+	if (is_vmalloc_addr((void *)vaddr))
+		start = vmalloc_to_pfn((void *) vaddr) << PAGE_SHIFT;
+	else
+		start = __pa(vaddr);
+
+	end = start + numpages * PAGE_SIZE;
 
 	if (!enc) {
 		/* Set the shared (decrypted) bits: */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
                   ` (2 preceding siblings ...)
  2022-06-09  2:52 ` [PATCH v8 3/5] x86/mm: Make tdx_enc_status_changed() vmalloc address compatible Kuppuswamy Sathyanarayanan
@ 2022-06-09  2:52 ` Kuppuswamy Sathyanarayanan
  2022-06-24 13:19   ` Dave Hansen
  2022-06-09  2:52 ` [PATCH v8 5/5] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2022-06-24 18:24 ` [PATCH v8 0/5] Add TDX Guest Attestation support Dave Hansen
  5 siblings, 1 reply; 77+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-06-09  2:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

set_memory_*crypted() functions are used to modify the "shared" page
attribute of the given memory. Using these APIs will modify the page
attributes of the aliased mappings (which also includes the direct
mapping).

But such aliased mappings modification is not desirable in use cases
like TDX guest, where the requirement is to create the shared mapping
without touching the direct map. It is used when allocating VMM shared
buffers using alloc_pages()/vmap()/set_memory_*crypted() API
combinations.

So to support such use cases, add support for noalias variants of
set_memory_*crypted() functions.

Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/include/asm/set_memory.h |  2 ++
 arch/x86/mm/pat/set_memory.c      | 26 ++++++++++++++++++++------
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index b45c4d27fd46..a79d59692494 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -46,7 +46,9 @@ int set_memory_wb(unsigned long addr, int numpages);
 int set_memory_np(unsigned long addr, int numpages);
 int set_memory_4k(unsigned long addr, int numpages);
 int set_memory_encrypted(unsigned long addr, int numpages);
+int set_memory_encrypted_noalias(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
+int set_memory_decrypted_noalias(unsigned long addr, int numpages);
 int set_memory_np_noalias(unsigned long addr, int numpages);
 int set_memory_nonglobal(unsigned long addr, int numpages);
 int set_memory_global(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1abd5438f126..ad6c8ece737d 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2021,7 +2021,8 @@ int set_memory_global(unsigned long addr, int numpages)
  * __set_memory_enc_pgtable() is used for the hypervisors that get
  * informed about "encryption" status via page tables.
  */
-static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
+static int __set_memory_enc_pgtable(unsigned long addr, int numpages,
+		bool enc, int checkalias)
 {
 	pgprot_t empty = __pgprot(0);
 	struct cpa_data cpa;
@@ -2049,7 +2050,7 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	/* Notify hypervisor that we are about to set/clr encryption attribute. */
 	x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
 
-	ret = __change_page_attr_set_clr(&cpa, 1);
+	ret = __change_page_attr_set_clr(&cpa, checkalias);
 
 	/*
 	 * After changing the encryption attribute, we need to flush TLBs again
@@ -2069,29 +2070,42 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	return ret;
 }
 
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
+static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc,
+		int checkalias)
 {
 	if (hv_is_isolation_supported())
 		return hv_set_mem_host_visibility(addr, numpages, !enc);
 
 	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
-		return __set_memory_enc_pgtable(addr, numpages, enc);
+		return __set_memory_enc_pgtable(addr, numpages, enc, checkalias);
 
 	return 0;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, true);
+	return __set_memory_enc_dec(addr, numpages, true, 1);
 }
 EXPORT_SYMBOL_GPL(set_memory_encrypted);
 
 int set_memory_decrypted(unsigned long addr, int numpages)
 {
-	return __set_memory_enc_dec(addr, numpages, false);
+	return __set_memory_enc_dec(addr, numpages, false, 1);
 }
 EXPORT_SYMBOL_GPL(set_memory_decrypted);
 
+int set_memory_encrypted_noalias(unsigned long addr, int numpages)
+{
+	return __set_memory_enc_dec(addr, numpages, true, 0);
+}
+EXPORT_SYMBOL_GPL(set_memory_encrypted_noalias);
+
+int set_memory_decrypted_noalias(unsigned long addr, int numpages)
+{
+	return __set_memory_enc_dec(addr, numpages, false, 0);
+}
+EXPORT_SYMBOL_GPL(set_memory_decrypted_noalias);
+
 int set_pages_uc(struct page *page, int numpages)
 {
 	unsigned long addr = (unsigned long)page_address(page);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
                   ` (3 preceding siblings ...)
  2022-06-09  2:52 ` [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions Kuppuswamy Sathyanarayanan
@ 2022-06-09  2:52 ` Kuppuswamy Sathyanarayanan
  2022-06-14 12:30   ` Wander Lairson Costa
                     ` (2 more replies)
  2022-06-24 18:24 ` [PATCH v8 0/5] Add TDX Guest Attestation support Dave Hansen
  5 siblings, 3 replies; 77+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-06-09  2:52 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

In TDX guest, the second stage in attestation process is to send the
TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
not support communication channels like vsock or TCP/IP, implement
support to get TD Quote using hypercall. GetQuote hypercall can be used
by the TD guest to request VMM facilitate the Quote generation via
QE/QGS. More details about GetQuote hypercall can be found in TDX
Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
titled "TDG.VP.VMCALL<GetQuote>.

Since GetQuote is an asynchronous request hypercall, it will not block
till the TD Quote is generated. So VMM uses callback interrupt vector
configured by SetupEventNotifyInterrupt hypercall to notify the guest
about Quote generation completion or failure.

GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer without breaking the direct map, allocate physically
contiguous kernel memory and create a virtual mapping for it using
vmap(). set_memory_*crypted_noalias() functions can be used to share or
unshare the vmapped page without affecting the direct map.

Also note that, shared buffer allocation is currently handled in IOCTL
handler, although it will increase the TDX_CMD_GET_QUOTE IOCTL response
time, it is negligible compared to the time required for the quote
generation completion. So IOCTL performance optimization is not
considered at this time.

For shared buffer allocation, alternatives like using the DMA API is
also considered. Although it simpler to use, it is not preferred because
dma_alloc_*() APIs require a valid bus device as argument, which would
need converting the attestation driver into a platform device driver.
This is unnecessary, and since the attestation driver does not do real
DMA, there is no need to use real DMA APIs.

Add support for TDX_CMD_GET_QUOTE IOCTL to allow attestation agent
submit GetQuote requests from the user space. Since Quote generation
is an asynchronous request, IOCTL will block indefinitely for the VMM
response in wait_for_completion_interruptible() call. Using this call
will also add an option for the user to end the current request
prematurely by raising any signals. This can be used by attestation
agent to implement Quote generation timeout feature. If attestation
agent is aware of time it can validly wait for QE/QGS response, then
a possible timeout support can be implemented in the user application
using signals. Quote generation timeout feature is currently not
implemented in the driver because the current TDX specification does
not have any recommendation for it.

After submitting the GetQuote request using hypercall, the shared buffer
allocated for the current request is owned by the VMM. So, during this
wait window, if the user terminates the request by raising a signal or
by terminating the application, add a logic to do the memory cleanup
after receiving the VMM response at a later time. Such memory cleanup
support requires accepting the page again using TDX_ACCEPT_PAGE TDX
Module call. So to not overload the callback IRQ handler, move the
callback handler logic to a separate work queue.

To support parallel GetQuote requests, use linked list to track the
active GetQuote requests and upon receiving the callback IRQ, loop
through the active requests and mark the processed requests complete.
Users can open multiple instances of the attestation device and send
GetQuote requests in parallel.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/attest.c      | 304 ++++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/tdx.h |  45 +++++
 2 files changed, 349 insertions(+)

diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
index 24db0bad4923..a2a94622922c 100644
--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -13,16 +13,56 @@
 #include <linux/miscdevice.h>
 #include <linux/mm.h>
 #include <linux/io.h>
+#include <linux/set_memory.h>
+#include <linux/mutex.h>
 #include <asm/tdx.h>
+#include <asm/coco.h>
 #include <uapi/asm/tdx.h>
 
 #define DRIVER_NAME "tdx-attest"
 
 /* TDREPORT module call leaf ID */
 #define TDX_GET_REPORT			4
+/* GetQuote hypercall leaf ID */
+#define TDVMCALL_GET_QUOTE             0x10002
+
+/* Used for buffer allocation in GetQuote request */
+struct quote_buf {
+	/* vmapped address of kernel buffer (size is page aligned) */
+	void *vmaddr;
+	/* Number of pages */
+	int count;
+};
+
+/* List entry of quote_list */
+struct quote_entry {
+	/* Flag to check validity of the GetQuote request */
+	bool valid;
+	/* Kernel buffer to share data with VMM */
+	struct quote_buf buf;
+	/* Completion object to track completion of GetQuote request */
+	struct completion compl;
+	struct list_head list;
+};
 
 static struct miscdevice miscdev;
 
+/*
+ * To support parallel GetQuote requests, use the list
+ * to track active GetQuote requests.
+ */
+static LIST_HEAD(quote_list);
+
+/* Lock to protect quote_list */
+static DEFINE_MUTEX(quote_lock);
+
+/*
+ * Workqueue to handle Quote data after Quote generation
+ * notification from VMM.
+ */
+struct workqueue_struct *quote_wq;
+struct work_struct quote_work;
+
 static long tdx_get_report(void __user *argp)
 {
 	void *reportdata = NULL, *tdreport = NULL;
@@ -71,6 +111,260 @@ static long tdx_get_report(void __user *argp)
 	return ret;
 }
 
+/* tdx_get_quote_hypercall() - Request to get TD Quote using TDREPORT */
+static long tdx_get_quote_hypercall(struct quote_buf *buf)
+{
+	struct tdx_hypercall_args args = {0};
+
+	args.r10 = TDX_HYPERCALL_STANDARD;
+	args.r11 = TDVMCALL_GET_QUOTE;
+	args.r12 = cc_mkdec(page_to_phys(vmalloc_to_page(buf->vmaddr)));
+	args.r13 = buf->count * PAGE_SIZE;
+
+	/*
+	 * Pass the physical address of TDREPORT to the VMM and
+	 * trigger the Quote generation. It is not a blocking
+	 * call, hence completion of this request will be notified to
+	 * the TD guest via a callback interrupt. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), sec titled "TDG.VP.VMCALL<GetQuote>".
+	 */
+	return __tdx_hypercall(&args, 0);
+}
+
+/*
+ * init_quote_buf() - Initialize the quote buffer by allocating
+ *                    a shared buffer of given size.
+ *
+ * Size is page aligned and the allocated memory is decrypted
+ * to allow VMM to access it. Uses VMAP to create a virtual
+ * mapping, which is further used to create a shared mapping
+ * for the buffer without affecting the direct map.
+ */
+static int init_quote_buf(struct quote_buf *buf, u64 req_size)
+{
+	int size = PAGE_ALIGN(req_size);
+	void *addr = NULL, *vmaddr = NULL;
+	int count = size >> PAGE_SHIFT;
+	struct page **pages = NULL;
+	int i;
+
+	addr = alloc_pages_exact(size, GFP_KERNEL);
+	if (!addr)
+		return -ENOMEM;
+
+	/* Allocate mem for array of page ptrs */
+	pages = kcalloc(count, sizeof(*pages), GFP_KERNEL);
+	if (!pages) {
+		free_pages_exact(addr, size);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < count; i++)
+		pages[i] = virt_to_page(addr + i * PAGE_SIZE);
+
+	/*
+	 * Use VMAP to create a virtual mapping, which is used
+	 * to create shared mapping without affecting the
+	 * direct map. Use VM_MAP_PUT_PAGES to allow vmap()
+	 * responsible for freeing the pages when using vfree().
+	 */
+	vmaddr = vmap(pages, count, VM_MAP_PUT_PAGES, PAGE_KERNEL);
+	if (!vmaddr) {
+		kfree(pages);
+		free_pages_exact(addr, size);
+		return -EIO;
+	}
+
+	/* Use noalias variant to not affect the direct mapping */
+	if (set_memory_decrypted_noalias((unsigned long)vmaddr, count)) {
+		vfree(vmaddr);
+		return -EIO;
+	}
+
+	buf->vmaddr = vmaddr;
+	buf->count = count;
+
+	return 0;
+}
+
+/* Remove the shared mapping and free the memory */
+static void deinit_quote_buf(struct quote_buf *buf)
+{
+	if (!buf)
+		return;
+
+	/* Mark pages private */
+	if (set_memory_encrypted_noalias((unsigned long)buf->vmaddr,
+				buf->count)) {
+		pr_warn("Failed to encrypt %d pages at %p", buf->count,
+				buf->vmaddr);
+		return;
+	}
+
+	vfree(buf->vmaddr);
+}
+
+static struct quote_entry *alloc_quote_entry(u64 buf_len)
+{
+	struct quote_entry *entry = NULL;
+
+	entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return NULL;
+
+	/* Init buffer for quote request */
+	if (init_quote_buf(&entry->buf, buf_len)) {
+		kfree(entry);
+		return NULL;
+	}
+
+	init_completion(&entry->compl);
+	entry->valid = true;
+
+	return entry;
+}
+
+static void free_quote_entry(struct quote_entry *entry)
+{
+	deinit_quote_buf(&entry->buf);
+	kfree(entry);
+}
+
+/* Must be called with quote_lock held */
+static void _del_quote_entry(struct quote_entry *entry)
+{
+	list_del(&entry->list);
+	free_quote_entry(entry);
+}
+
+static void del_quote_entry(struct quote_entry *entry)
+{
+	mutex_lock(&quote_lock);
+	_del_quote_entry(entry);
+	mutex_unlock(&quote_lock);
+}
+
+/* Handles early termination of GetQuote requests */
+void terminate_quote_request(struct quote_entry *entry)
+{
+	struct tdx_quote_hdr *quote_hdr;
+
+	/*
+	 * For early termination, if the request is not yet
+	 * processed by VMM (GET_QUOTE_IN_FLIGHT), the VMM
+	 * still owns the shared buffer, so mark the request
+	 * invalid to let quote_callback_handler() handle the
+	 * memory cleanup function. If the request is already
+	 * processed, then do the cleanup and return.
+	 */
+
+	mutex_lock(&quote_lock);
+	quote_hdr = (struct tdx_quote_hdr *)entry->buf.vmaddr;
+	if (quote_hdr->status == GET_QUOTE_IN_FLIGHT) {
+		entry->valid = false;
+		mutex_unlock(&quote_lock);
+		return;
+	}
+	_del_quote_entry(entry);
+	mutex_unlock(&quote_lock);
+}
+
+static long tdx_get_quote(void __user *argp)
+{
+	struct quote_entry *entry;
+	struct tdx_quote_req req;
+	struct quote_buf *buf;
+	long ret;
+
+	/* Copy GetQuote request struct from user buffer */
+	if (copy_from_user(&req, argp, sizeof(struct tdx_quote_req)))
+		return -EFAULT;
+
+	/* Make sure the length is valid */
+	if (!req.len)
+		return -EINVAL;
+
+	entry = alloc_quote_entry(req.len);
+	if (!entry)
+		return -ENOMEM;
+
+	buf = &entry->buf;
+
+	/* Copy TDREPORT from user buffer to kernel Quote buffer */
+	if (copy_from_user(buf->vmaddr, (void __user *)req.buf, req.len)) {
+		free_quote_entry(entry);
+		return -EFAULT;
+	}
+
+	mutex_lock(&quote_lock);
+
+	/* Submit GetQuote Request */
+	ret = tdx_get_quote_hypercall(buf);
+	if (ret) {
+		mutex_unlock(&quote_lock);
+		pr_err("GetQuote hypercall failed, status:%lx\n", ret);
+		free_quote_entry(entry);
+		return -EIO;
+	}
+
+	/* Add current quote entry to quote_list to track active requests */
+	list_add_tail(&entry->list, &quote_list);
+
+	mutex_unlock(&quote_lock);
+
+	/* Wait for attestation completion */
+	ret = wait_for_completion_interruptible(&entry->compl);
+	if (ret < 0) {
+		terminate_quote_request(entry);
+		return -EINTR;
+	}
+
+	/*
+	 * If GetQuote request completed successfully, copy the result
+	 * back to the user and do the cleanup.
+	 */
+	if (copy_to_user((void __user *)req.buf, buf->vmaddr, req.len))
+		ret = -EFAULT;
+
+	/*
+	 * Reaching here means GetQuote request is processed
+	 * successfully. So do the cleanup and return 0.
+	 */
+	del_quote_entry(entry);
+
+	return 0;
+}
+
+static void attestation_callback_handler(void)
+{
+	queue_work(quote_wq, &quote_work);
+}
+
+static void quote_callback_handler(struct work_struct *work)
+{
+	struct tdx_quote_hdr *quote_hdr;
+	struct quote_entry *entry, *next;
+
+	/* Find processed quote request and mark it complete */
+	mutex_lock(&quote_lock);
+	list_for_each_entry_safe(entry, next, &quote_list, list) {
+		quote_hdr = (struct tdx_quote_hdr *)entry->buf.vmaddr;
+		if (quote_hdr->status == GET_QUOTE_IN_FLIGHT)
+			continue;
+		/*
+		 * If user invalidated the current request, remove the
+		 * entry from the quote list and free it. If the request
+		 * is still valid, mark it complete.
+		 */
+		if (entry->valid)
+			complete(&entry->compl);
+		else
+			_del_quote_entry(entry);
+	}
+	mutex_unlock(&quote_lock);
+}
+
 static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
 			     unsigned long arg)
 {
@@ -81,6 +375,9 @@ static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
 	case TDX_CMD_GET_REPORT:
 		ret = tdx_get_report(argp);
 		break;
+	case TDX_CMD_GET_QUOTE:
+		ret = tdx_get_quote(argp);
+		break;
 	default:
 		pr_debug("cmd %d not supported\n", cmd);
 		break;
@@ -103,6 +400,13 @@ static int __init tdx_attestation_init(void)
 	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
 		return -EIO;
 
+	quote_wq = create_singlethread_workqueue("tdx_quote_handler");
+
+	INIT_WORK(&quote_work, quote_callback_handler);
+
+	/* Register attestation event notify handler */
+	tdx_setup_ev_notify_handler(attestation_callback_handler);
+
 	miscdev.name = DRIVER_NAME;
 	miscdev.minor = MISC_DYNAMIC_MINOR;
 	miscdev.fops = &tdx_attest_fops;
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
index 8b57dea67eab..51944dd0462f 100644
--- a/arch/x86/include/uapi/asm/tdx.h
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -39,4 +39,49 @@ struct tdx_report_req {
  */
 #define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
 
+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf         : Pass user data that includes TDREPORT as input. Upon
+ *                successful completion of IOCTL, output is copied
+ *                back to the same buffer.
+ * @len         : Length of the buffer.
+ */
+struct tdx_quote_req {
+	__u64 buf;
+	__u64 len;
+};
+
+/*
+ * TDX_CMD_GET_QUOTE - Get TD Quote from QE/QGS using GetQuote
+ *		       TDVMCALL.
+ *
+ * Returns 0 on success, -EINTR for interrupted request, and
+ * standard errono on other failures.
+ */
+#define TDX_CMD_GET_QUOTE		_IOR('T', 0x02, struct tdx_quote_req)
+
+/* TD Quote status codes */
+#define GET_QUOTE_SUCCESS		0
+#define GET_QUOTE_IN_FLIGHT		0xffffffffffffffff
+#define GET_QUOTE_ERROR			0x8000000000000000
+#define GET_QUOTE_SERVICE_UNAVAILABLE	0x8000000000000001
+
+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+	/* Quote version, filled by TD */
+	__u64 version;
+	/* Status code of Quote request, filled by VMM */
+	__u64 status;
+	/* Length of TDREPORT, filled by TD */
+	__u32 in_len;
+	/* Length of Quote, filled by VMM */
+	__u32 out_len;
+	/* Actual Quote data */
+	__u64 data[0];
+};
+
 #endif /* _UAPI_ASM_X86_TDX_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-06-09  2:52 ` [PATCH v8 5/5] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
@ 2022-06-14 12:30   ` Wander Lairson Costa
  2022-06-14 12:58     ` Sathyanarayanan Kuppuswamy
  2022-07-21 16:08   ` Dave Hansen
  2022-07-22  0:27   ` Dave Hansen
  2 siblings, 1 reply; 77+ messages in thread
From: Wander Lairson Costa @ 2022-06-14 12:30 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Wed, Jun 08, 2022 at 07:52:20PM -0700, Kuppuswamy Sathyanarayanan wrote:

[snip]

> +}
> +
> +/* Remove the shared mapping and free the memory */
> +static void deinit_quote_buf(struct quote_buf *buf)
> +{
> +	if (!buf)
> +		return;

nit: the null check isn't necessary anymore, is it?

[snip]


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-06-14 12:30   ` Wander Lairson Costa
@ 2022-06-14 12:58     ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-14 12:58 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 6/14/22 5:30 AM, Wander Lairson Costa wrote:
> On Wed, Jun 08, 2022 at 07:52:20PM -0700, Kuppuswamy Sathyanarayanan wrote:
> 
> [snip]
> 
>> +}
>> +
>> +/* Remove the shared mapping and free the memory */
>> +static void deinit_quote_buf(struct quote_buf *buf)
>> +{
>> +	if (!buf)
>> +		return;
> 
> nit: the null check isn't necessary anymore, is it?

Yes. I will remove it.

> 
> [snip]
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-09  2:52 ` [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
@ 2022-06-20 12:33   ` Kai Huang
  2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kai Huang @ 2022-06-20 12:33 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
> Host-guest event notification via configured interrupt vector is useful
> in cases where a guest makes an asynchronous request and needs a
> callback from the host to indicate the completion or to let the host
> notify the guest about events like device removal. One usage example is,
> callback requirement of GetQuote asynchronous hypercall.

Although this paragraph is from GHCI spec, IMHO it is not very helpful.  In
fact, I think this paragraph is not that right and should be removed from GHCI.
The reason is such event notification from VMM in cases like "device removal" is
too vague.  There's no _specification_ in GHCI around which "device removal"
should VMM inject such event.  For instance, I _think_ the Qemu enumerated ACPI-
based hotplug should continue to work in TD.

That being said, if a TD has multiple devices, it cannot know whether the VMM
will inject the removal event via the vector set by SetupEventNotifyInterrupt. 
And for the same device in the same TD, different VMMs may use different way to
notify its removal.

It seems GetQuote is the only user of SetupEventNotifyInterrupt.  Maybe we
should just declare it is for GetQuote.

Isaku, what do you think?  Does this make sense?

> 
> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
> guest to specify which interrupt vector to use as an event-notify
> vector to the VMM. Details about the SetupEventNotifyInterrupt
> hypercall can be found in TDX Guest-Host Communication Interface
> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
> Add a tdx_hcall_set_notify_intr() helper function to implement the
> SetupEventNotifyInterrupt hypercall.

As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>). 
This means theoretically TD should implement pooling mode in case VMM doesn't
support injecting event via vector done by SetupEventNotifyInterrupt?

Perhaps we should update the GHCI spec to use must..

> 
> Reserve 0xec IRQ vector address for TDX guest to receive the event
> completion notification from VMM. Also add related IDT handler to
> process the notification event.

Here lacks why we need to choose to reserve a system vector.  For instance, why
we cannot choose to use device IRQ way which only requires one vector on one
cpu.  As you can see reserving a system vector isn't ideal especially for
attestation as it is not a frequent operation.  It is wasteful of using IRQ
resource especially on server systems with a lot of CPUs.

The reason is SetupEventNotifyInterrupt TDVMCALL only has one argument, which is
vector, but cannot specify which CPU that the VMM should inject the event to. 
The GHCI spec doesn't say which CPU the VMM should inject to (i.e. must inject
to the CPU on which SetupEventNotifyInterrupt is called), so we can only assume
VMM can inject to any CPU.

Btw, x86 maintainers,

I'd like to check with you to see whether we should improve the existing
SetupEventNotifyInterrupt so we can choose to use request_irq() style for
attestation.  Using request_irq() means we don't need to reserve a system
vector, but can allocate a vector dynamically when needed.

Assuming we update SetupEventNotifyInterrupt to also allow TD to specify which
CPU (i.e. via APICID) to inject (along with the vector), my understanding is we
can use below way (idea only) to dynamically allocate a vector on one CPU when
attestation is needed:


	int cpu, vector;
	int irq;

	// request an IRQ, and prevent it from being migrated
	irq = __irq_domain_alloc_irqs(x86_vector_domain, 0, 1, ...);
	request_irq(irq, ...);
	
	// get vector, cpu from irq

	TDVMCALL<SetupEventNotifyInterrupt>(vector, 
		apic->cpu_present_to_apidid(cpu));
	
Is this reasonable? If yes, is it worth to do?

-- 
Thanks,
-Kai


> 
> Add support to track the notification event status via
> /proc/interrupts.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Wander Lairson Costa <wander@redhat.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/coco/tdx/tdx.c            | 71 ++++++++++++++++++++++++++++++
>  arch/x86/include/asm/hardirq.h     |  3 ++
>  arch/x86/include/asm/idtentry.h    |  4 ++
>  arch/x86/include/asm/irq_vectors.h |  7 ++-
>  arch/x86/include/asm/tdx.h         |  4 ++
>  arch/x86/kernel/irq.c              |  7 +++
>  6 files changed, 95 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 03deb4d6920d..94542876f26a 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -11,6 +11,10 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/apic.h>
> +#include <asm/idtentry.h>
> +#include <asm/irq_regs.h>
> +#include <asm/desc.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
> @@ -19,6 +23,7 @@
>  
>  /* TDX hypercall Leaf IDs */
>  #define TDVMCALL_MAP_GPA		0x10001
> +#define TDVMCALL_SETUP_NOTIFY_INTR	0x10004
>  
>  /* MMIO direction */
>  #define EPT_READ	0
> @@ -34,6 +39,26 @@
>  #define VE_GET_PORT_NUM(e)	((e) >> 16)
>  #define VE_IS_IO_STRING(e)	((e) & BIT(4))
>  
> +/*
> + * Handler used to report notifications about
> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> + * used only by the attestation driver. So, race condition
> + * with read/write operation is not considered.
> + */
> +static void (*tdx_event_notify_handler)(void);
> +
> +/* Helper function to register tdx_event_notify_handler */
> +void tdx_setup_ev_notify_handler(void (*handler)(void))
> +{
> +	tdx_event_notify_handler = handler;
> +}
> +
> +/* Helper function to unregister tdx_event_notify_handler */
> +void tdx_remove_ev_notify_handler(void)
> +{
> +	tdx_event_notify_handler = NULL;
> +}
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -98,6 +123,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>  }
>  
> +/* TDX guest event notification handler */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
> +{
> +	struct pt_regs *old_regs = set_irq_regs(regs);
> +
> +	inc_irq_stat(irq_tdx_event_notify_count);
> +
> +	if (tdx_event_notify_handler)
> +		tdx_event_notify_handler();
> +
> +	ack_APIC_irq();
> +
> +	set_irq_regs(old_regs);
> +}
> +
> +/*
> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
> + *
> + * @vector: Vector address to be used for notification.
> + *
> + * return 0 on success or failure error number.
> + */
> +static long tdx_hcall_set_notify_intr(u8 vector)
> +{
> +	/* Minimum vector value allowed is 32 */
> +	if (vector < 32)
> +		return -EINVAL;
> +
> +	/*
> +	 * Register callback vector address with VMM. More details
> +	 * about the ABI can be found in TDX Guest-Host-Communication
> +	 * Interface (GHCI), sec titled
> +	 * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
> +	 */
> +	if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
> +		return -EIO;
> +
> +	return 0;
> +}
> +
>  static u64 get_cc_mask(void)
>  {
>  	struct tdx_module_output out;
> @@ -688,5 +753,11 @@ void __init tdx_early_init(void)
>  	x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
>  	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
>  
> +	alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
> +			asm_sysvec_tdx_event_notify);
> +
> +	if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
> +		pr_warn("Setting event notification interrupt failed\n");
> +
>  	pr_info("Guest detected\n");
>  }
> diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
> index 275e7fd20310..582deff56210 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -44,6 +44,9 @@ typedef struct {
>  	unsigned int irq_hv_reenlightenment_count;
>  	unsigned int hyperv_stimer0_count;
>  #endif
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +	unsigned int irq_tdx_event_notify_count;
> +#endif
>  } ____cacheline_aligned irq_cpustat_t;
>  
>  DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 72184b0b2219..655086dd940e 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_xen_hvm_callback);
>  DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_kvm_asyncpf_interrupt);
>  #endif
>  
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR,	sysvec_tdx_event_notify);
> +#endif
> +
>  #undef X86_TRAP_OTHER
>  
>  #endif
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 43dcb9284208..82ac0c0a34b1 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -104,7 +104,12 @@
>  #define HYPERV_STIMER0_VECTOR		0xed
>  #endif
>  
> -#define LOCAL_TIMER_VECTOR		0xec
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +/* Vector on which TDX Guest event notification is delivered */
> +#define TDX_GUEST_EVENT_NOTIFY_VECTOR	0xec
> +#endif
> +
> +#define LOCAL_TIMER_VECTOR		0xeb
>  
>  #define NR_VECTORS			 256
>  
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 020c81a7c729..eb4db837cc44 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -67,6 +67,10 @@ void tdx_safe_halt(void);
>  
>  bool tdx_early_handle_ve(struct pt_regs *regs);
>  
> +void tdx_setup_ev_notify_handler(void (*handler)(void));
> +
> +void tdx_remove_ev_notify_handler(void);
> +
>  #else
>  
>  static inline void tdx_early_init(void) { };
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 766ffe3ba313..a96ecd866723 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
>  		seq_printf(p, "%10u ",
>  			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
>  	seq_puts(p, "  Posted-interrupt wakeup event\n");
> +#endif
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +	seq_printf(p, "%*s: ", prec, "TGN");
> +	for_each_online_cpu(j)
> +		seq_printf(p, "%10u ",
> +			   irq_stats(j)->irq_tdx_event_notify_count);
> +	seq_puts(p, "  TDX Guest event notification\n");
>  #endif
>  	return 0;
>  }



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-20 12:33   ` Kai Huang
@ 2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
  2022-06-23  9:46       ` Kai Huang
                         ` (3 more replies)
  0 siblings, 4 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-20 15:44 UTC (permalink / raw)
  To: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel, Nakajima, Jun

Hi,

+ Jun

On 6/20/22 5:33 AM, Kai Huang wrote:
> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
>> Host-guest event notification via configured interrupt vector is useful
>> in cases where a guest makes an asynchronous request and needs a
>> callback from the host to indicate the completion or to let the host
>> notify the guest about events like device removal. One usage example is,
>> callback requirement of GetQuote asynchronous hypercall.
> 
> Although this paragraph is from GHCI spec, IMHO it is not very helpful.  In
> fact, I think this paragraph is not that right and should be removed from GHCI.
> The reason is such event notification from VMM in cases like "device removal" is
> too vague.  There's no _specification_ in GHCI around which "device removal"
> should VMM inject such event.  For instance, I _think_ the Qemu enumerated ACPI-
> based hotplug should continue to work in TD.

Yes. It just says that it *can* be used to signal a device removal. It is just
an example for where it can be used. But I agree that such a use case is vague.
If it makes it better, I am fine with removing it.

Copied from sec 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>:

"Example of an operation that can use the event notify is the host
VMM signaling a device removal to the TD, in response to which a TD may
unload a device driver."

> 
> That being said, if a TD has multiple devices, it cannot know whether the VMM
> will inject the removal event via the vector set by SetupEventNotifyInterrupt. 
> And for the same device in the same TD, different VMMs may use different way to
> notify its removal.

As per current design, If it is used for device removal, I think all registered
device drivers will get the notification and the individual device driver has
to check whether it is applicable for them.

If the SetupEventNotifyInterrupt TDVMCALL specification is extended to specify
the exact device or use case detail, then it can optimize the implementation.

> 
> It seems GetQuote is the only user of SetupEventNotifyInterrupt.  Maybe we
> should just declare it is for GetQuote.

Ok.

> 
> Isaku, what do you think?  Does this make sense?
> 
>>
>> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
>> guest to specify which interrupt vector to use as an event-notify
>> vector to the VMM. Details about the SetupEventNotifyInterrupt
>> hypercall can be found in TDX Guest-Host Communication Interface
>> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
>> Add a tdx_hcall_set_notify_intr() helper function to implement the
>> SetupEventNotifyInterrupt hypercall.
> 
> As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
> set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>). 
> This means theoretically TD should implement pooling mode in case VMM doesn't
> support injecting event via vector done by SetupEventNotifyInterrupt?

Yes. But GetQuote specification does not talk about the pooling mode
use case as well. So I think it is just a wording confusion.

> 
> Perhaps we should update the GHCI spec to use must..

Ok.

> 
>>
>> Reserve 0xec IRQ vector address for TDX guest to receive the event
>> completion notification from VMM. Also add related IDT handler to
>> process the notification event.
> 
> Here lacks why we need to choose to reserve a system vector.  For instance, why
> we cannot choose to use device IRQ way which only requires one vector on one

As you have explained below, as per current spec, it just expects a system
vector.

> cpu.  As you can see reserving a system vector isn't ideal especially for
> attestation as it is not a frequent operation.  It is wasteful of using IRQ

I agree that event notification is currently only used for attestation. But I
think in future there could be other use cases for it. If the intention is just
to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
the vector address, and there is no need for new TDVMCALL. I think the intention
here is to have generic method for VMM to notify TD about some events. I am not
clear about the possible future use cases, so I cannot comment on frequency of
its use.

Jun, any comments?



> resource especially on server systems with a lot of CPUs.

FWIW, this reservation is protected with CONFIG_INTEL_TDX_GUEST. So it will be
reserved only for TDX use case.


> 
> The reason is SetupEventNotifyInterrupt TDVMCALL only has one argument, which is
> vector, but cannot specify which CPU that the VMM should inject the event to. 
> The GHCI spec doesn't say which CPU the VMM should inject to (i.e. must inject
> to the CPU on which SetupEventNotifyInterrupt is called), so we can only assume
> VMM can inject to any CPU.
> 
> Btw, x86 maintainers,
> 
> I'd like to check with you to see whether we should improve the existing
> SetupEventNotifyInterrupt so we can choose to use request_irq() style for
> attestation.  Using request_irq() means we don't need to reserve a system
> vector, but can allocate a vector dynamically when needed.
> 
> Assuming we update SetupEventNotifyInterrupt to also allow TD to specify which
> CPU (i.e. via APICID) to inject (along with the vector), my understanding is we
> can use below way (idea only) to dynamically allocate a vector on one CPU when
> attestation is needed:
> 
> 
> 	int cpu, vector;
> 	int irq;
> 
> 	// request an IRQ, and prevent it from being migrated
> 	irq = __irq_domain_alloc_irqs(x86_vector_domain, 0, 1, ...);
> 	request_irq(irq, ...);
> 	
> 	// get vector, cpu from irq
> 
> 	TDVMCALL<SetupEventNotifyInterrupt>(vector, 
> 		apic->cpu_present_to_apidid(cpu));
> 	
> Is this reasonable? If yes, is it worth to do?
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
@ 2022-06-23  9:46       ` Kai Huang
  2022-06-23 10:24       ` Kai Huang
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-06-23  9:46 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel, Nakajima, Jun

> 
> > 
> > That being said, if a TD has multiple devices, it cannot know whether the VMM
> > will inject the removal event via the vector set by SetupEventNotifyInterrupt. 
> > And for the same device in the same TD, different VMMs may use different way to
> > notify its removal.
> 
> As per current design, If it is used for device removal, I think all registered
> device drivers will get the notification and the individual device driver has
> to check whether it is applicable for them.

The problem is there's no _specification_ around this.  As I said above, I don't
see why TDX architecture cannot support Qemu enumerated ACPI-based hotplug,
which uses SCI.  Maybe I am missing something here, but IMHO we should have some
_specification_ around SetupEventNotifyInterrupt in GHCI.

> 
> If the SetupEventNotifyInterrupt TDVMCALL specification is extended to specify
> the exact device or use case detail, then it can optimize the implementation.
> 
> > 
> > It seems GetQuote is the only user of SetupEventNotifyInterrupt.  Maybe we
> > should just declare it is for GetQuote.
> 
> Ok.

If you believe this is reasonable, perhaps you can drive the spec change.

> 
> > 
> > Isaku, what do you think?  Does this make sense?
> > 
> > > 
> > > In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
> > > guest to specify which interrupt vector to use as an event-notify
> > > vector to the VMM. Details about the SetupEventNotifyInterrupt
> > > hypercall can be found in TDX Guest-Host Communication Interface
> > > (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
> > > Add a tdx_hcall_set_notify_intr() helper function to implement the
> > > SetupEventNotifyInterrupt hypercall.
> > 
> > As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
> > set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>). 
> > This means theoretically TD should implement pooling mode in case VMM doesn't
> > support injecting event via vector done by SetupEventNotifyInterrupt?
> 
> Yes. But GetQuote specification does not talk about the pooling mode
> use case as well. So I think it is just a wording confusion.

It doesn't need to mention I think.  "can" means VMM can choose to inject or
not, but not must, which basically implies GetQuote should support pooling.

> 
> > 
> > Perhaps we should update the GHCI spec to use must..
> 
> Ok.

If you don't want to support pooling, I guess you'd better to improve the GHCI.

[...]

> 
> 
> > resource especially on server systems with a lot of CPUs.
> 
> FWIW, this reservation is protected with CONFIG_INTEL_TDX_GUEST. So it will be
> reserved only for TDX use case.

This reason doesn't stand.  I think distributions basically tends to enable all
Kconfig options so one binary works on all machines, so it can be (maybe likely)
turned on even on bare-metal machines.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
  2022-06-23  9:46       ` Kai Huang
@ 2022-06-23 10:24       ` Kai Huang
  2022-06-24 22:23       ` Sathyanarayanan Kuppuswamy
  2022-06-24 23:41       ` Nakajima, Jun
  3 siblings, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-06-23 10:24 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel, Nakajima, Jun

On Mon, 2022-06-20 at 08:44 -0700, Sathyanarayanan Kuppuswamy wrote:
> I agree that event notification is currently only used for attestation. But I
> think in future there could be other use cases for it. 
> 

We have TDVMCALL<Service> in GHCI 1.5.  It allows specifying the vector.  I
don't know whether there's any future case that still needs
SetupEventNotifyInterrupt.

> If the intention is just
> to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
> the vector address, and there is no need for new TDVMCALL. 

The TDVMCALL<Service> in GHCI 1.5 actually supports specifying the vector.  It
currently neither clearly says "the VMM should inject event to the vCPU on which
the TDVMCALL<Service> is called", nor it allows specifying which vCPU to inject.
Assuming we can lobby (I am already lobbying, actually) to make either way, in
fact I think it can be used as IRQ way (like I described in my first reply).

I once lobbied to completely remove GetQuote (and SetupEventNotifyInterrupt)
from GHCI 1.0 and move TDVMCALL<Service> from GHCI 1.5 to 1.0, and define a new
sub-service for attestation -- GetQuote will be one command then, and we can
define more attestation related commands (which is likely) when needed.  I
believe using TDVMCALL<Service> for GetQuote can actually allow kernel to use
IRQ way, but doesn't need a system vector.  But this wasn't successful.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-06-09  2:52 ` [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions Kuppuswamy Sathyanarayanan
@ 2022-06-24 13:19   ` Dave Hansen
  2022-06-27 15:12     ` Kirill A. Shutemov
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-06-24 13:19 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> set_memory_*crypted() functions are used to modify the "shared" page
> attribute of the given memory. Using these APIs will modify the page
> attributes of the aliased mappings (which also includes the direct
> mapping).
> 
> But such aliased mappings modification is not desirable in use cases
> like TDX guest, where the requirement is to create the shared mapping
> without touching the direct map. It is used when allocating VMM shared
> buffers using alloc_pages()/vmap()/set_memory_*crypted() API
> combinations.
> 
> So to support such use cases, add support for noalias variants of
> set_memory_*crypted() functions.

This changelog has a lot of words, but doesn't tell us much.

It basically says, we don't want to touch the direct map in use cases
where we don't want to touch the direct map.  Not helpful.

The alias processing is there for a reason.  What is it?  Why don't you
need alias processing for TDX?  Sure, you decided you don't want to
touch the direct map, but *WHY*?  Why is it normally OK to touch the
direct map, but not OK in this case?  Even better, why is it *DESIRABLE*
to leave the direct map alone?  Lastly, why is this safe?  If alias
processing was to protect us from something, why is losing that
protection OK here?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-06-09  2:52 ` [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
@ 2022-06-24 16:51   ` Dave Hansen
  2022-06-27 14:50     ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-06-24 16:51 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> TDREPORT can only be verified on local platform as the MAC key is bound
> to the platform. To support remote verification of the TDREPORT, TDX
> leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT locally
> and convert it to a remote verifiable Quote.
> 
> After getting the TDREPORT, the second step of the attestation process
> is to send it to the QE to generate the Quote. TDX doesn't support SGX
> inside the TD, so the QE can be deployed in the host, or in another
> legacy VM with SGX support. How to send the TDREPORT to QE and receive
> the Quote is implementation and deployment specific.

On high-level comment: this whole series is quite agnostic about what is
actually attested.  On some level, I guess it doesn't matter.  But, it
makes me a bit uneasy.  There needs to be at least *some* kind of claim
about that somewhere, preferably a sentence in the cover letter and
sentence or two here.

Second, how can someone test this code?  It appears that they need to
assemble a veritable Rube Goldberg machine.  The least we could do is
have a selftest that just calls the ioctl() and makes sure that
something halfway sane comes out of it.

> In such
> case, since REPORTDATA is a secret, using sysfs to share it is insecure
> compared to sending it via IOCTL.

Huh?  How is sysfs "insecure"?

> +	/*
> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
> +	 *
> +	 * Get the TDREPORT using REPORTDATA as input. Refer to
> +	 * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
> +	 * Specification for detailed information.
> +	 */
> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
> +				virt_to_phys(reportdata), 0, 0, NULL);

One of those 0's is a "Report sub type".  Those bits in the module call
are presumably because it won't *always* be zero.  But, if there's ever
a new report type, we'll need another user/kernel ABI.  That doesn't
seem great.

The TDX module spec doesn't even give a name to "sub report type 0".
That is not super helpful for deciding what it *means*.

Right now, the entire ABI is basically:

	ret = ioctl(TDX_GET_REPORT, &buffer);

That _implies_ a ton of stuff:

	1. report sub type 0
	2. a specific input length REPORTDATA
	3. a specific output length

I don't want to over-engineer this thing, but it would make a lot of
sense to me to feed the ioctl() with something like this:

struct tdx_report
{
	u8 report_sub_type;
	u64 report_input_data;
	u64 report_input_data_len;

	u64 report_output_data;
	u64 report_output_data_len;
}

That makes *everything* explicit.  It makes it utterly clear what goes
where, *AND* it makes userspace declare it.

But, you have:

> +/**
> + * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
> + *
> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
> + *		 TDREPORT. Typically it can be some nonce provided by
> + *		 attestation service, so the generated TDREPORT can be
> + *		 uniquely verified.
> + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
> + *		 TDX_REPORT_LEN.
> + *
> + * Used in TDX_CMD_GET_REPORT IOCTL request.
> + */
> +struct tdx_report_req {
> +	union {
> +		__u8 reportdata[TDX_REPORTDATA_LEN];
> +		__u8 tdreport[TDX_REPORT_LEN];
> +	};
> +};
> +

and a bunch of code copying in and out of this structure:

> +static long tdx_get_report(void __user *argp)
> +{
> +	void *reportdata = NULL, *tdreport = NULL;
...
> +	/* Copy REPORTDATA from the user buffer */
> +	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}

But none of that code even bothers to *use* the structure!

What's with the union?  Are you really just trying to save 64 bytes of
space?  Is that worth it?

How many of these "drivers" are we going to need which are thinly veiled
ioctl()s that are only TDX module call wrappers?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 0/5] Add TDX Guest Attestation support
  2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
                   ` (4 preceding siblings ...)
  2022-06-09  2:52 ` [PATCH v8 5/5] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
@ 2022-06-24 18:24 ` Dave Hansen
  2022-06-27 14:51   ` Sathyanarayanan Kuppuswamy
  5 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-06-24 18:24 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> Following patches add the attestation support to TDX guest which
> includes attestation user interface driver and related hypercall support.

This is also the place where you lay out the roadmap:

	1. Get a report
	2. Get a quote
	 2a. Interrupt support because quotes take a long time
	 2b. Actual quote module calls and ABI

Right?  That seems worth a few sentences in the cover letter.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
  2022-06-23  9:46       ` Kai Huang
  2022-06-23 10:24       ` Kai Huang
@ 2022-06-24 22:23       ` Sathyanarayanan Kuppuswamy
  2022-06-24 23:41       ` Nakajima, Jun
  3 siblings, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-24 22:23 UTC (permalink / raw)
  To: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel, Nakajima, Jun, Yao,
	Jiewen

+ Jiewen

Jiewen, Can you please comment on the specification related queries?

On 6/20/22 8:44 AM, Sathyanarayanan Kuppuswamy wrote:
> Hi,
> 
> + Jun
> 
> On 6/20/22 5:33 AM, Kai Huang wrote:
>> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
>>> Host-guest event notification via configured interrupt vector is useful
>>> in cases where a guest makes an asynchronous request and needs a
>>> callback from the host to indicate the completion or to let the host
>>> notify the guest about events like device removal. One usage example is,
>>> callback requirement of GetQuote asynchronous hypercall.
>>
>> Although this paragraph is from GHCI spec, IMHO it is not very helpful.  In
>> fact, I think this paragraph is not that right and should be removed from GHCI.
>> The reason is such event notification from VMM in cases like "device removal" is
>> too vague.  There's no _specification_ in GHCI around which "device removal"
>> should VMM inject such event.  For instance, I _think_ the Qemu enumerated ACPI-
>> based hotplug should continue to work in TD.
> 
> Yes. It just says that it *can* be used to signal a device removal. It is just
> an example for where it can be used. But I agree that such a use case is vague.
> If it makes it better, I am fine with removing it.
> 
> Copied from sec 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>:
> 
> "Example of an operation that can use the event notify is the host
> VMM signaling a device removal to the TD, in response to which a TD may
> unload a device driver."
> 
>>
>> That being said, if a TD has multiple devices, it cannot know whether the VMM
>> will inject the removal event via the vector set by SetupEventNotifyInterrupt. 
>> And for the same device in the same TD, different VMMs may use different way to
>> notify its removal.
> 
> As per current design, If it is used for device removal, I think all registered
> device drivers will get the notification and the individual device driver has
> to check whether it is applicable for them.
> 
> If the SetupEventNotifyInterrupt TDVMCALL specification is extended to specify
> the exact device or use case detail, then it can optimize the implementation.
> 
>>
>> It seems GetQuote is the only user of SetupEventNotifyInterrupt.  Maybe we
>> should just declare it is for GetQuote.
> 
> Ok.
> 
>>
>> Isaku, what do you think?  Does this make sense?
>>
>>>
>>> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
>>> guest to specify which interrupt vector to use as an event-notify
>>> vector to the VMM. Details about the SetupEventNotifyInterrupt
>>> hypercall can be found in TDX Guest-Host Communication Interface
>>> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
>>> Add a tdx_hcall_set_notify_intr() helper function to implement the
>>> SetupEventNotifyInterrupt hypercall.
>>
>> As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
>> set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>). 
>> This means theoretically TD should implement pooling mode in case VMM doesn't
>> support injecting event via vector done by SetupEventNotifyInterrupt?
> 
> Yes. But GetQuote specification does not talk about the pooling mode
> use case as well. So I think it is just a wording confusion.
> 
>>
>> Perhaps we should update the GHCI spec to use must..
> 
> Ok.
> 
>>
>>>
>>> Reserve 0xec IRQ vector address for TDX guest to receive the event
>>> completion notification from VMM. Also add related IDT handler to
>>> process the notification event.
>>
>> Here lacks why we need to choose to reserve a system vector.  For instance, why
>> we cannot choose to use device IRQ way which only requires one vector on one
> 
> As you have explained below, as per current spec, it just expects a system
> vector.
> 
>> cpu.  As you can see reserving a system vector isn't ideal especially for
>> attestation as it is not a frequent operation.  It is wasteful of using IRQ
> 
> I agree that event notification is currently only used for attestation. But I
> think in future there could be other use cases for it. If the intention is just
> to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
> the vector address, and there is no need for new TDVMCALL. I think the intention
> here is to have generic method for VMM to notify TD about some events. I am not
> clear about the possible future use cases, so I cannot comment on frequency of
> its use.
> 
> Jun, any comments?
> 
> 
> 
>> resource especially on server systems with a lot of CPUs.
> 
> FWIW, this reservation is protected with CONFIG_INTEL_TDX_GUEST. So it will be
> reserved only for TDX use case.
> 
> 
>>
>> The reason is SetupEventNotifyInterrupt TDVMCALL only has one argument, which is
>> vector, but cannot specify which CPU that the VMM should inject the event to. 
>> The GHCI spec doesn't say which CPU the VMM should inject to (i.e. must inject
>> to the CPU on which SetupEventNotifyInterrupt is called), so we can only assume
>> VMM can inject to any CPU.
>>
>> Btw, x86 maintainers,
>>
>> I'd like to check with you to see whether we should improve the existing
>> SetupEventNotifyInterrupt so we can choose to use request_irq() style for
>> attestation.  Using request_irq() means we don't need to reserve a system
>> vector, but can allocate a vector dynamically when needed.
>>
>> Assuming we update SetupEventNotifyInterrupt to also allow TD to specify which
>> CPU (i.e. via APICID) to inject (along with the vector), my understanding is we
>> can use below way (idea only) to dynamically allocate a vector on one CPU when
>> attestation is needed:
>>
>>
>> 	int cpu, vector;
>> 	int irq;
>>
>> 	// request an IRQ, and prevent it from being migrated
>> 	irq = __irq_domain_alloc_irqs(x86_vector_domain, 0, 1, ...);
>> 	request_irq(irq, ...);
>> 	
>> 	// get vector, cpu from irq
>>
>> 	TDVMCALL<SetupEventNotifyInterrupt>(vector, 
>> 		apic->cpu_present_to_apidid(cpu));
>> 	
>> Is this reasonable? If yes, is it worth to do?
>>
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
                         ` (2 preceding siblings ...)
  2022-06-24 22:23       ` Sathyanarayanan Kuppuswamy
@ 2022-06-24 23:41       ` Nakajima, Jun
  2022-06-25  3:35         ` Yao, Jiewen
  3 siblings, 1 reply; 77+ messages in thread
From: Nakajima, Jun @ 2022-06-24 23:41 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Huang, Kai, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Kirill A . Shutemov, Luck,
	Tony, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, Cox, Philip, LKML,
	Yao, Jiewen

Replying to this (not the latest one) to reduce the quoting levels, adding Jiewen.


> On Jun 20, 2022, at 8:44 AM, Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> 
> Hi,
> 
> + Jun
> 
> On 6/20/22 5:33 AM, Kai Huang wrote:
>> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
>>> Host-guest event notification via configured interrupt vector is useful
>>> in cases where a guest makes an asynchronous request and needs a
>>> callback from the host to indicate the completion or to let the host
>>> notify the guest about events like device removal. One usage example is,
>>> callback requirement of GetQuote asynchronous hypercall.
>> 
>> Although this paragraph is from GHCI spec, IMHO it is not very helpful.  In
>> fact, I think this paragraph is not that right and should be removed from GHCI.
>> The reason is such event notification from VMM in cases like "device removal" is
>> too vague.  There's no _specification_ in GHCI around which "device removal"
>> should VMM inject such event.  For instance, I _think_ the Qemu enumerated ACPI-
>> based hotplug should continue to work in TD.
> 
> Yes. It just says that it *can* be used to signal a device removal. It is just
> an example for where it can be used. But I agree that such a use case is vague.
> If it makes it better, I am fine with removing it.

Yes, the “device removal” is just an example, especially, "the TD OS should be designed to not use the event notification for trusted operations”, based on the context of the spec. 

>> 
>>> 
>>> Reserve 0xec IRQ vector address for TDX guest to receive the event
>>> completion notification from VMM. Also add related IDT handler to
>>> process the notification event.
>> 
>> Here lacks why we need to choose to reserve a system vector.  For instance, why
>> we cannot choose to use device IRQ way which only requires one vector on one
> 
> As you have explained below, as per current spec, it just expects a system
> vector.
> 
>> cpu.  As you can see reserving a system vector isn't ideal especially for
>> attestation as it is not a frequent operation.  It is wasteful of using IRQ
> 
> I agree that event notification is currently only used for attestation. But I
> think in future there could be other use cases for it. If the intention is just
> to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
> the vector address, and there is no need for new TDVMCALL. I think the intention
> here is to have generic method for VMM to notify TD about some events. I am not
> clear about the possible future use cases, so I cannot comment on frequency of
> its use.
> 
> Jun, any comments?
> 

The GHCI spec was not just clear, and we’ll update the spec, for example:

3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt> 
...
From:

“The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. “
 
To:

“The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt>, via the posted-interrupt descriptor. “

---
Jun


^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-24 23:41       ` Nakajima, Jun
@ 2022-06-25  3:35         ` Yao, Jiewen
  2022-06-27 11:21           ` Kai Huang
  0 siblings, 1 reply; 77+ messages in thread
From: Yao, Jiewen @ 2022-06-25  3:35 UTC (permalink / raw)
  To: Nakajima, Jun, Sathyanarayanan Kuppuswamy
  Cc: Huang, Kai, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Kirill A . Shutemov, Luck,
	Tony, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, Cox, Philip, LKML

Thank you, Jun.

Yes. I confirmed that we will include below change to GHCI.next spec.

================
3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>

From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "

To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "

3.13 TDG.VP.VMCALL<Service>

Table 3-39: TDG.VP.VMCALL< Service >-Input Operands

From: R14: 
"Event notification interrupt vector - (valid values 32~255) selected by TD
0: blocking action. VMM need get response then return.
1~31: Reserved. Should not be used.
32~255: Non-block action. VMM can return immediately and signal the interrupt vector when the response is ready. "

To: R14:
"Event notification interrupt vector - (valid values 32~255) selected by TD
0: blocking action. VMM need get response then return.
1~31: Reserved. Should not be used.
32~255: Non-block action. VMM can return immediately and signal the interrupt vector when the response is ready. VMM should inject interrupt vector into the TD VCPU that executed TDG.VP.VMCALL<Service>."

================

Thank you
Yao Jiewen


> -----Original Message-----
> From: Nakajima, Jun <jun.nakajima@intel.com>
> Sent: Saturday, June 25, 2022 7:42 AM
> To: Sathyanarayanan Kuppuswamy
> <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: Huang, Kai <kai.huang@intel.com>; Thomas Gleixner <tglx@linutronix.de>;
> Ingo Molnar <mingo@redhat.com>; Borislav Petkov <bp@alien8.de>; Dave
> Hansen <dave.hansen@linux.intel.com>; x86@kernel.org; H . Peter Anvin
> <hpa@zytor.com>; Kirill A . Shutemov <kirill.shutemov@linux.intel.com>; Luck,
> Tony <tony.luck@intel.com>; Andi Kleen <ak@linux.intel.com>; Wander Lairson
> Costa <wander@redhat.com>; Isaku Yamahata <isaku.yamahata@gmail.com>;
> marcelo.cerri@canonical.com; tim.gardner@canonical.com;
> khalid.elmously@canonical.com; Cox, Philip <philip.cox@canonical.com>; LKML
> <linux-kernel@vger.kernel.org>; Yao, Jiewen <jiewen.yao@intel.com>
> Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt
> support
> 
> Replying to this (not the latest one) to reduce the quoting levels, adding Jiewen.
> 
> 
> > On Jun 20, 2022, at 8:44 AM, Sathyanarayanan Kuppuswamy
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> >
> > Hi,
> >
> > + Jun
> >
> > On 6/20/22 5:33 AM, Kai Huang wrote:
> >> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> Host-guest event notification via configured interrupt vector is useful
> >>> in cases where a guest makes an asynchronous request and needs a
> >>> callback from the host to indicate the completion or to let the host
> >>> notify the guest about events like device removal. One usage example is,
> >>> callback requirement of GetQuote asynchronous hypercall.
> >>
> >> Although this paragraph is from GHCI spec, IMHO it is not very helpful.  In
> >> fact, I think this paragraph is not that right and should be removed from GHCI.
> >> The reason is such event notification from VMM in cases like "device
> removal" is
> >> too vague.  There's no _specification_ in GHCI around which "device
> removal"
> >> should VMM inject such event.  For instance, I _think_ the Qemu enumerated
> ACPI-
> >> based hotplug should continue to work in TD.
> >
> > Yes. It just says that it *can* be used to signal a device removal. It is just
> > an example for where it can be used. But I agree that such a use case is vague.
> > If it makes it better, I am fine with removing it.
> 
> Yes, the “device removal” is just an example, especially, "the TD OS should be
> designed to not use the event notification for trusted operations”, based on the
> context of the spec.
> 
> >>
> >>>
> >>> Reserve 0xec IRQ vector address for TDX guest to receive the event
> >>> completion notification from VMM. Also add related IDT handler to
> >>> process the notification event.
> >>
> >> Here lacks why we need to choose to reserve a system vector.  For instance,
> why
> >> we cannot choose to use device IRQ way which only requires one vector on
> one
> >
> > As you have explained below, as per current spec, it just expects a system
> > vector.
> >
> >> cpu.  As you can see reserving a system vector isn't ideal especially for
> >> attestation as it is not a frequent operation.  It is wasteful of using IRQ
> >
> > I agree that event notification is currently only used for attestation. But I
> > think in future there could be other use cases for it. If the intention is just
> > to use it for attestation, then we can just modify the GetQuote TDVMCALL to
> pass
> > the vector address, and there is no need for new TDVMCALL. I think the
> intention
> > here is to have generic method for VMM to notify TD about some events. I am
> not
> > clear about the possible future use cases, so I cannot comment on frequency
> of
> > its use.
> >
> > Jun, any comments?
> >
> 
> The GHCI spec was not just clear, and we’ll update the spec, for example:
> 
> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
> ...
> From:
> 
> “The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at
> the requested-interrupt vector into the TD via the posted-interrupt descriptor. “
> 
> To:
> 
> “The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at
> the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL
> <SetupEventNotifyInterrupt>, via the posted-interrupt descriptor. “
> 
> ---
> Jun


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-25  3:35         ` Yao, Jiewen
@ 2022-06-27 11:21           ` Kai Huang
  2022-06-27 14:56             ` Sathyanarayanan Kuppuswamy
  2022-07-14  0:46             ` Sathyanarayanan Kuppuswamy
  0 siblings, 2 replies; 77+ messages in thread
From: Kai Huang @ 2022-06-27 11:21 UTC (permalink / raw)
  To: Yao, Jiewen, Nakajima, Jun, Sathyanarayanan Kuppuswamy
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Luck, Tony, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, LKML

On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
> Thank you, Jun.
> 
> Yes. I confirmed that we will include below change to GHCI.next spec.
> 
> ================
> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
> 
> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
> 
> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
> 

Hi Sathy,

With this change, I don't think we should use system vector anymore.  Instead,
we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-06-24 16:51   ` Dave Hansen
@ 2022-06-27 14:50     ` Sathyanarayanan Kuppuswamy
  2022-06-27 17:24       ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-27 14:50 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi Dave,

Thanks for the review.

On 6/24/22 9:51 AM, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>> TDREPORT can only be verified on local platform as the MAC key is bound
>> to the platform. To support remote verification of the TDREPORT, TDX
>> leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT locally
>> and convert it to a remote verifiable Quote.
>>
>> After getting the TDREPORT, the second step of the attestation process
>> is to send it to the QE to generate the Quote. TDX doesn't support SGX
>> inside the TD, so the QE can be deployed in the host, or in another
>> legacy VM with SGX support. How to send the TDREPORT to QE and receive
>> the Quote is implementation and deployment specific.
> 
> On high-level comment: this whole series is quite agnostic about what is
> actually attested.  On some level, I guess it doesn't matter.  But, it
> makes me a bit uneasy.  There needs to be at least *some* kind of claim
> about that somewhere, preferably a sentence in the cover letter and
> sentence or two here.

We are attesting that the given TD Guest is secure and trustworthy, and
the remote server requesting attestation can safely transfer the secrets.

Are you looking for something like the above?

Or you want to talk about the details involved (like TDREPORT hash
details)?

If it is not what you are looking for, then maybe I misunderstood your
question. Can you please elaborate?

> 
> Second, how can someone test this code?  It appears that they need to
> assemble a veritable Rube Goldberg machine.  The least we could do is
> have a selftest that just calls the ioctl() and makes sure that
> something halfway sane comes out of it.

My initial submission included a test app. But I removed it later to
reduce the review load. I thought to submit the test app once feature
support patches are merged.

https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/

If you prefer, I can add it back to the next submission with the latest changes.

> 
>> In such
>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>> compared to sending it via IOCTL.
> 
> Huh?  How is sysfs "insecure"?

REPORTDATA (input) we pass to the Module call might come from attestation
service as a per session unique ID.  If it is shared via sysfs, there is
a possibility for untrusted software to read it and trigger some form of
reply attack. So in this context, sysfs is insecure compared to IOCTL
approach. You can find the related discussion in,

https://lore.kernel.org/lkml/b8eadd3079101a2cf93ee87d36dbedf93d8a2725.camel@intel.com/

> 
>> +	/*
>> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>> +	 *
>> +	 * Get the TDREPORT using REPORTDATA as input. Refer to
>> +	 * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
>> +	 * Specification for detailed information.
>> +	 */
>> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>> +				virt_to_phys(reportdata), 0, 0, NULL);
> 
> One of those 0's is a "Report sub type".  Those bits in the module call
> are presumably because it won't *always* be zero.  But, if there's ever
> a new report type, we'll need another user/kernel ABI.  That doesn't
> seem great.
> 
> The TDX module spec doesn't even give a name to "sub report type 0".
> That is not super helpful for deciding what it *means*.
> 
> Right now, the entire ABI is basically:
> 
> 	ret = ioctl(TDX_GET_REPORT, &buffer);
> 
> That _implies_ a ton of stuff:
> 
> 	1. report sub type 0
> 	2. a specific input length REPORTDATA
> 	3. a specific output length
> 
> I don't want to over-engineer this thing, but it would make a lot of
> sense to me to feed the ioctl() with something like this:
> 
> struct tdx_report
> {
> 	u8 report_sub_type;
> 	u64 report_input_data;
> 	u64 report_input_data_len;
> 
> 	u64 report_output_data;
> 	u64 report_output_data_len;
> }
> 
> That makes *everything* explicit.  It makes it utterly clear what goes
> where, *AND* it makes userspace declare it.

Ok. Parameter to handle subtype makes sense. But regarding the input and
output length, currently there is no indication in specification that it
will change and there is no way to pass it in the module call. So do you
still prefer to add parameters for it?

> 
> But, you have:
> 
>> +/**
>> + * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
>> + *
>> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
>> + *		 TDREPORT. Typically it can be some nonce provided by
>> + *		 attestation service, so the generated TDREPORT can be
>> + *		 uniquely verified.
>> + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
>> + *		 TDX_REPORT_LEN.
>> + *
>> + * Used in TDX_CMD_GET_REPORT IOCTL request.
>> + */
>> +struct tdx_report_req {
>> +	union {
>> +		__u8 reportdata[TDX_REPORTDATA_LEN];
>> +		__u8 tdreport[TDX_REPORT_LEN];
>> +	};
>> +};
>> +
> 
> and a bunch of code copying in and out of this structure:
> 
>> +static long tdx_get_report(void __user *argp)
>> +{
>> +	void *reportdata = NULL, *tdreport = NULL;
> ...
>> +	/* Copy REPORTDATA from the user buffer */
>> +	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
>> +		ret = -EFAULT;
>> +		goto out;
>> +	}
> 
> But none of that code even bothers to *use* the structure!

I have created that struct mainly to document the ABI details for the
userspace clarity (like input and output). Since the length of the input
and output is defined as macro, it worked fine for driver implementation.
I am fine with modifying the code to use it. Please let me know if you
prefer it.

> 
> What's with the union?  Are you really just trying to save 64 bytes of
> space?  Is that worth it?

Makes sense. I will change it to separate variables.

> 
> How many of these "drivers" are we going to need which are thinly veiled
> ioctl()s that are only TDX module call wrappers?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 0/5] Add TDX Guest Attestation support
  2022-06-24 18:24 ` [PATCH v8 0/5] Add TDX Guest Attestation support Dave Hansen
@ 2022-06-27 14:51   ` Sathyanarayanan Kuppuswamy
  2022-06-27 18:51     ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-27 14:51 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi,

On 6/24/22 11:24 AM, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>> Following patches add the attestation support to TDX guest which
>> includes attestation user interface driver and related hypercall support.
> 
> This is also the place where you lay out the roadmap:
> 
> 	1. Get a report
> 	2. Get a quote
> 	 2a. Interrupt support because quotes take a long time
> 	 2b. Actual quote module calls and ABI
> 
> Right?  That seems worth a few sentences in the cover letter.

Ok. I will update the cover letter with brief introduction to
changes involved.

How about following?

In TDX guest, attestation process generally involves the following steps:

1. Get the TDREPORT using user specified REPORTDATA. This is implemented
   using TDG.MR.TDREPORT Module call. An IOCTL interface is added to let
   userspace get the TDREPORT data  (implemented in patch #1).
   
2. Using the TDREPORT data, generate a remotely verifiable signed Quote.
   Quote can be generated either using GetQuote hypercall or by communicating
   with VMM/Quoting Enclave(QE) using VSOCK. In this patch set, only the
   GetQuote hypercall model is supported. Since Quote generation is an
   asynchronous request, and takes more time, we let VMM notify the TDX Guest
   using the callback interrupt. Patch # 2-5 implements Quote generation support,
   in which Patch # 2 implements the callback interrupt support.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-27 11:21           ` Kai Huang
@ 2022-06-27 14:56             ` Sathyanarayanan Kuppuswamy
  2022-07-14  0:46             ` Sathyanarayanan Kuppuswamy
  1 sibling, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-27 14:56 UTC (permalink / raw)
  To: Kai Huang, Yao, Jiewen, Nakajima, Jun
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Luck, Tony, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, LKML



On 6/27/22 4:21 AM, Kai Huang wrote:
> On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
>> Thank you, Jun.
>>
>> Yes. I confirmed that we will include below change to GHCI.next spec.
>>
>> ================
>> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>>
>> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>>
>> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>>
> 
> Hi Sathy,
> 
> With this change, I don't think we should use system vector anymore.  Instead,
> we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
> 

Yes. I will try your suggestion and make necessary changes in next version.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-06-24 13:19   ` Dave Hansen
@ 2022-06-27 15:12     ` Kirill A. Shutemov
  2022-06-27 18:24       ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Kirill A. Shutemov @ 2022-06-27 15:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Kai Huang, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On Fri, Jun 24, 2022 at 06:19:23AM -0700, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> > set_memory_*crypted() functions are used to modify the "shared" page
> > attribute of the given memory. Using these APIs will modify the page
> > attributes of the aliased mappings (which also includes the direct
> > mapping).
> > 
> > But such aliased mappings modification is not desirable in use cases
> > like TDX guest, where the requirement is to create the shared mapping
> > without touching the direct map. It is used when allocating VMM shared
> > buffers using alloc_pages()/vmap()/set_memory_*crypted() API
> > combinations.
> > 
> > So to support such use cases, add support for noalias variants of
> > set_memory_*crypted() functions.
> 
> This changelog has a lot of words, but doesn't tell us much.
> 
> It basically says, we don't want to touch the direct map in use cases
> where we don't want to touch the direct map.  Not helpful.
> 
> The alias processing is there for a reason.  What is it?  Why don't you
> need alias processing for TDX?  Sure, you decided you don't want to
> touch the direct map, but *WHY*?  Why is it normally OK to touch the
> direct map, but not OK in this case?  Even better, why is it *DESIRABLE*
> to leave the direct map alone?  Lastly, why is this safe?  If alias
> processing was to protect us from something, why is losing that
> protection OK here?

The whole idea of alloc_pages()/vmap()/set_memory_decrypted() exercise is
to avoid direct map fragmentation.

Alias processing modifies direct mapping and highmap in addition to the
mapping user requested the modification for. Alias processing is required
to keep different mappings in sync, but here we want to avoid this because
it leads to direct mapping fragmentation.

Normally, direct mapping modifications are done for long-lasting
allocation where fragmentation is lesser concern. Here we have transient
allocation which may repeat over and over trashing the direct mapping.

Speaking of safety, I wanted initially claim that it is safe as only owner
of the allocation will touch the memory and it only does via the vmap
mapping.

*BUT* 

It made me thing about my recent story with load_unaligned_zeropad(). 
If we leave the page in direct mapping mapped as private and
load_unaligned_zeropad() will roll off to it, we will get SEPT violation
that will terminate the TD as it is considered unaccepted.

I think we must keep aliases in think. And vmap() doesn't make much sense
in this case :/

I urge you folks to consider DMA API again. Or have any other way to tap
into swiotlb pool.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-06-27 14:50     ` Sathyanarayanan Kuppuswamy
@ 2022-06-27 17:24       ` Dave Hansen
  2022-06-30 23:50         ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-06-27 17:24 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:
>> On high-level comment: this whole series is quite agnostic about what is
>> actually attested.  On some level, I guess it doesn't matter.  But, it
>> makes me a bit uneasy.  There needs to be at least *some* kind of claim
>> about that somewhere, preferably a sentence in the cover letter and
>> sentence or two here.
> 
> We are attesting that the given TD Guest is secure and trustworthy, and
> the remote server requesting attestation can safely transfer the secrets.

What is *actually* attested?

> Are you looking for something like the above?

No, I'd like some actual facts, please.  I want key examples of things
visible to an end user that might change that might determine if a guest
is "secure and trustworthy".

Does a kernel command-line parameter change attestation?  How about the
disk image?  The kernel image?  initrd?  Memory *size*?  Devices?

Compare this mechanism to something that exists outside the TDX world.

>> Second, how can someone test this code?  It appears that they need to
>> assemble a veritable Rube Goldberg machine.  The least we could do is
>> have a selftest that just calls the ioctl() and makes sure that
>> something halfway sane comes out of it.
> 
> My initial submission included a test app. But I removed it later to
> reduce the review load. I thought to submit the test app once feature
> support patches are merged.
> 
> https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
> 
> If you prefer, I can add it back to the next submission with the latest changes.

I doubt anyone will ever run a "test app".  Why not just make this a
selftest?

>>> In such
>>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>>> compared to sending it via IOCTL.
>>
>> Huh?  How is sysfs "insecure"?
> 
> REPORTDATA (input) we pass to the Module call might come from attestation
> service as a per session unique ID.  If it is shared via sysfs, there is
> a possibility for untrusted software to read it and trigger some form of
> reply attack. So in this context, sysfs is insecure compared to IOCTL
> approach. You can find the related discussion in,
> 
> https://lore.kernel.org/lkml/b8eadd3079101a2cf93ee87d36dbedf93d8a2725.camel@intel.com/

I still don't get it.

How is a 400 sysfs file "insecure"?  This sounds "if the filesystem
permissions are broken" paranoia.  In other words, you're protecting
against a malicious root user.

In other words, I don't think the ABI here has been well thought out.

Also, this is basically a:

	Inputs:

		* nonce
		* report type

	Outputs:

		* report

I have to wonder how hard it would be to have this be:

	fd = open("/dev/foo");
	ioctl(fd, REPORT, type, flags??);
	write(fd, &inputs, inputs_len);
	read(fd,  &output, outputs_len);



>>> +	/*
>>> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>>> +	 *
>>> +	 * Get the TDREPORT using REPORTDATA as input. Refer to
>>> +	 * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
>>> +	 * Specification for detailed information.
>>> +	 */
>>> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>>> +				virt_to_phys(reportdata), 0, 0, NULL);
>>
>> One of those 0's is a "Report sub type".  Those bits in the module call
>> are presumably because it won't *always* be zero.  But, if there's ever
>> a new report type, we'll need another user/kernel ABI.  That doesn't
>> seem great.
>>
>> The TDX module spec doesn't even give a name to "sub report type 0".
>> That is not super helpful for deciding what it *means*.
>>
>> Right now, the entire ABI is basically:
>>
>> 	ret = ioctl(TDX_GET_REPORT, &buffer);
>>
>> That _implies_ a ton of stuff:
>>
>> 	1. report sub type 0
>> 	2. a specific input length REPORTDATA
>> 	3. a specific output length
>>
>> I don't want to over-engineer this thing, but it would make a lot of
>> sense to me to feed the ioctl() with something like this:
>>
>> struct tdx_report
>> {
>> 	u8 report_sub_type;
>> 	u64 report_input_data;
>> 	u64 report_input_data_len;
>>
>> 	u64 report_output_data;
>> 	u64 report_output_data_len;
>> }
>>
>> That makes *everything* explicit.  It makes it utterly clear what goes
>> where, *AND* it makes userspace declare it.
> 
> Ok. Parameter to handle subtype makes sense. But regarding the input and
> output length, currently there is no indication in specification that it
> will change and there is no way to pass it in the module call. So do you
> still prefer to add parameters for it?

It seems like a more flexible ABI to me.  What you have here seems to be
a *PURE* passthrough of the module ABI.

>> But, you have:
>>
>>> +/**
>>> + * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
>>> + *
>>> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
>>> + *		 TDREPORT. Typically it can be some nonce provided by
>>> + *		 attestation service, so the generated TDREPORT can be
>>> + *		 uniquely verified.
>>> + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
>>> + *		 TDX_REPORT_LEN.
>>> + *
>>> + * Used in TDX_CMD_GET_REPORT IOCTL request.
>>> + */
>>> +struct tdx_report_req {
>>> +	union {
>>> +		__u8 reportdata[TDX_REPORTDATA_LEN];
>>> +		__u8 tdreport[TDX_REPORT_LEN];
>>> +	};
>>> +};
>>> +
>>
>> and a bunch of code copying in and out of this structure:
>>
>>> +static long tdx_get_report(void __user *argp)
>>> +{
>>> +	void *reportdata = NULL, *tdreport = NULL;
>> ...
>>> +	/* Copy REPORTDATA from the user buffer */
>>> +	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
>>> +		ret = -EFAULT;
>>> +		goto out;
>>> +	}
>>
>> But none of that code even bothers to *use* the structure!
> 
> I have created that struct mainly to document the ABI details for the
> userspace clarity (like input and output). Since the length of the input
> and output is defined as macro, it worked fine for driver implementation.
> I am fine with modifying the code to use it. Please let me know if you
> prefer it.

Let's say I'm reviewing kernel code.  I want to know what the ABI of
'reportdata' is.  How do I find it?

Now, imagine that the code is:


	struct tdx_report_req __user *user_rd = argp;
	struct tdx_report_req *kern_rd;
	...

	if (copy_from_user(&kern_rd->reportdata,
			   &user_rd->reportdata,
			   sizeof(kern_rd->reportdata)) {


This tells me a *LOT*.  It says that the user and kernel buffers are the
same ABI, and I'm also much more that the copy matches the data
structure sizes.

We don't put structures in kernel headers just for our health.  We put
them so the kernel can use them.

>> What's with the union?  Are you really just trying to save 64 bytes of
>> space?  Is that worth it?
> 
> Makes sense. I will change it to separate variables.
> 
>> How many of these "drivers" are we going to need which are thinly veiled
>> ioctl()s that are only TDX module call wrappers?

This is actually a really big question.  This code is obviously just
trying to do one thing very narrowly and not thinking about the future
at all.  Can we please consider what the future will be like and try to
*architect* this instead of having the kernel just play a glorified game
of telephone?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-06-27 15:12     ` Kirill A. Shutemov
@ 2022-06-27 18:24       ` Dave Hansen
  2022-06-28  1:15         ` Kai Huang
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-06-27 18:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Kai Huang, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On 6/27/22 08:12, Kirill A. Shutemov wrote:
> It made me thing about my recent story with load_unaligned_zeropad(). 
> If we leave the page in direct mapping mapped as private and
> load_unaligned_zeropad() will roll off to it, we will get SEPT violation
> that will terminate the TD as it is considered unaccepted.
> 
> I think we must keep aliases in think. And vmap() doesn't make much sense
> in this case :/
> 
> I urge you folks to consider DMA API again. Or have any other way to tap
> into swiotlb pool.

Ugh.  This is a good point.  We need *PHYSICAL* pages to pad the front
of any page we use for the quotes.  That means some crazy code like:

	struct page *pages[nr_pages];
	struct page *pages_vmap[nr_pages];

	for (i = 0; i < nr_pages; i++) {
		// allocate an extra "padding" page:
		pages[i] = alloc_pages(1, GFP_WHATEVER);
				       ^ note the order=1
		// record the page that will be vmap()'d:
		pages_vmap[i] = pages[i]+1;
		set_pages_decrypted(page_to_virt(pages_vmap[i]));
	}

	vmap(pages_vmap, nr_pages);

That's just adorable.  The other way is to do alloc_pages_exact() with
*one* extra page and just use contiguous memory for it.

I still don't like the idea of using the DMA API itself.  But, maybe we
need some common infrastructure that the DMA API and this code use which
says, "get me some pages that I can safely make shared".

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 0/5] Add TDX Guest Attestation support
  2022-06-27 14:51   ` Sathyanarayanan Kuppuswamy
@ 2022-06-27 18:51     ` Dave Hansen
  0 siblings, 0 replies; 77+ messages in thread
From: Dave Hansen @ 2022-06-27 18:51 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/27/22 07:51, Sathyanarayanan Kuppuswamy wrote:
> In TDX guest, attestation process generally involves the following steps:
> 
> 1. Get the TDREPORT using user specified REPORTDATA. This is implemented
>    using TDG.MR.TDREPORT Module call. An IOCTL interface is added to let
>    userspace get the TDREPORT data  (implemented in patch #1).
>    
> 2. Using the TDREPORT data, generate a remotely verifiable signed Quote.
>    Quote can be generated either using GetQuote hypercall or by communicating
>    with VMM/Quoting Enclave(QE) using VSOCK. In this patch set, only the
>    GetQuote hypercall model is supported. Since Quote generation is an
>    asynchronous request, and takes more time, we let VMM notify the TDX Guest
>    using the callback interrupt. Patch # 2-5 implements Quote generation support,
>    in which Patch # 2 implements the callback interrupt support.

IMNHO, too much gibberish, not enough English, too much superfluous
information.

For instance, why do we need a quote and a report?  Why does this have
an interrupt?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-06-27 18:24       ` Dave Hansen
@ 2022-06-28  1:15         ` Kai Huang
  2022-07-05 15:29           ` Kirill A. Shutemov
  0 siblings, 1 reply; 77+ messages in thread
From: Kai Huang @ 2022-06-28  1:15 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel


> 
> I still don't like the idea of using the DMA API itself.  But, maybe we
> need some common infrastructure that the DMA API and this code use which
> says, "get me some pages that I can safely make shared".

Right.  For instance any KVM PV feature would require shared memory, and DMA API
normally doesn't fit (taking 'struct kvm_steal_time' as example).

Maybe we can reserve a CMA for this purpose.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-06-27 17:24       ` Dave Hansen
@ 2022-06-30 23:50         ` Sathyanarayanan Kuppuswamy
  2022-07-05 12:07           ` Kai Huang
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-06-30 23:50 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi,

On 6/27/22 10:24 AM, Dave Hansen wrote:
> On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:

>>> Second, how can someone test this code?  It appears that they need to
>>> assemble a veritable Rube Goldberg machine.  The least we could do is
>>> have a selftest that just calls the ioctl() and makes sure that
>>> something halfway sane comes out of it.
>>
>> My initial submission included a test app. But I removed it later to
>> reduce the review load. I thought to submit the test app once feature
>> support patches are merged.
>>
>> https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
>>
>> If you prefer, I can add it back to the next submission with the latest changes.
> 
> I doubt anyone will ever run a "test app".  Why not just make this a
> selftest?

Fine with me. I can change it into selftest.

> 
>>>> In such
>>>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>>>> compared to sending it via IOCTL.
>>>
>>> Huh?  How is sysfs "insecure"?
>>
>> REPORTDATA (input) we pass to the Module call might come from attestation
>> service as a per session unique ID.  If it is shared via sysfs, there is
>> a possibility for untrusted software to read it and trigger some form of
>> reply attack. So in this context, sysfs is insecure compared to IOCTL
>> approach. You can find the related discussion in,
>>
>> https://lore.kernel.org/lkml/b8eadd3079101a2cf93ee87d36dbedf93d8a2725.camel@intel.com/
> 
> I still don't get it.
> 
> How is a 400 sysfs file "insecure"?  This sounds "if the filesystem
> permissions are broken" paranoia.  In other words, you're protecting
> against a malicious root user.

AFAIK, root is not the only user of the attestation interface. General users can
also use it (For example, in a use case like pytorch, attestation may be requested
by server before passing the training data). So if the permission is not "root
only", then other users or application in TD can access the sysfs file to read
its contents. 

Also, the security concern mentioned is just an additional point. Following are
the main reasons for choosing IOCTL over sysfs.

1. Sysfs is generally used for config purposes. Not for request/response kind of
   use case (Attestation falls under this category).
2. If we only use sysfs model for GetReport, the design might look like below: 

    /sys/kernel/../report/input
    /sys/kernel/../report/subtype
    /sys/kernel/../report/input_len
    /sys/kernel/../report/output
    /sys/kernel/../report/output_len
    /sys/kernel/../report/status
    /sys/kernel/../report/trigger

We might need similar files for GetQuote use case as well. IMO, such a design is 
more complicated than using IOCTL.

> 
> In other words, I don't think the ABI here has been well thought out.
> 
> Also, this is basically a:
> 
> 	Inputs:
> 
> 		* nonce
> 		* report type
> 
> 	Outputs:
> 
> 		* report
> 
> I have to wonder how hard it would be to have this be:
> 
> 	fd = open("/dev/foo");
> 	ioctl(fd, REPORT, type, flags??);
> 	write(fd, &inputs, inputs_len);
> 	read(fd,  &output, outputs_len);
> 

It is not hard to implement it this way. But I don't see it being
very different from just using IOCTL. config/{read/write} model is
usually preferred for applications in which you have a requirement to do
multiple read/writes after one time config (use cases like serial
port read, printer write or reading temperature, etc). But in our case
we will mostly use it once after every config. 

Also, splitting input over IOCTL and write system call will require us
to add additional code to store the state.

I have attempted a sample implementation (untested) for reference. But I
feel our current code is simpler. It handles allocation and input/output
validation in one place compared to spreading it across multiple handlers. 

struct report_req {
        int subtype;
        void *reportdata;
        int reportdata_len;
};

struct tdx_attest_req {
        unsigned int cmd;
        void *data;
};


static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
                             unsigned long arg)
{
        void __user *argp = (void __user *)arg;
        struct tdx_attest_req *areq = file->private_data;
        struct report_req *rep_data = NULL;
        struct tdx_report_req user_req;
        long ret = -EINVAL;

        switch (cmd) {
        case TDX_CMD_GET_REPORT:
                /* Allocate space for TDREPORT request */
                rep_data = kmalloc(sizeof(struct report_req), GFP_KERNEL);
                if (!rep_data)
                        return -ENOMEM;

                /* Copy user request data */
                if (copy_from_user(&user_req, argp, sizeof(user_req))) {
                        kfree(rep_data);
                        return -EFAULT;
                }

                /* Copy user request details to areq->data */
                rep_data->subtype = user_req.subtype;
                areq->cmd = cmd;
                areq->data = rep_data;

                ret = 0;
                break;
        default:
                pr_debug("cmd %d not supported\n", cmd);
                break;
        }

        return ret;
}

static ssize_t tdx_attest_read(struct file *filp, char __user *buffer,
                                size_t length, loff_t *offset)
{
        struct tdx_attest_req *areq = filp->private_data;
        struct report_req *rep_data;
        void *tdreport;
        long ret;

        if (!areq)
                return -EINVAL;

        switch (areq->cmd) {
        case TDX_CMD_GET_REPORT:

                /* Check for valid length and offset */
                if (length != TDX_REPORT_LEN || offset != 0)
                        return -EINVAL;

                rep_data = areq->data;

                /* Make sure we have valid reportdata */
                if (!rep_data->reportdata)
                        return -EINVAL;

                /* Allocate space for output data */
                tdreport = kzalloc(length, GFP_KERNEL);
                if (!tdreport)
                        return -ENOMEM;
                /*
                 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
                 *
                 * Get the TDREPORT using REPORTDATA as input. Refer to
                 * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
                 * Specification for detailed information.
                 */
                ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
                                virt_to_phys(rep_data->reportdata), 0, 0, NULL);
                if (ret) {
                        pr_debug("TDREPORT TDCALL failed, status:%lx\n", ret);
                       kfree(tdreport);
                        return -EIO;
                }

                /* Copy REPORTDATA from the user buffer */
                if (copy_to_user(buffer, tdreport, length)) {
                        kfree(tdreport);
                        return -EFAULT;
                }

                return length;
        default:
                pr_debug("cmd %d not supported\n", areq->cmd);
                break;
        }

        return 0;
}

static ssize_t tdx_attest_write(struct file *filp, const char __user *buffer,
                                size_t length, loff_t *offset)
{
        struct tdx_attest_req *areq = filp->private_data;
        struct report_req *rep_data;

        if (!areq)
                return -EINVAL;

        switch (areq->cmd) {
        case TDX_CMD_GET_REPORT:

                /* Check for valid length and offset */
                if (length != TDX_REPORTDATA_LEN || offset != 0)
                        return -EINVAL;

                rep_data = areq->data;

                /* Allocate space to store REPORTDATA */
                rep_data->reportdata = kzalloc(length, GFP_KERNEL);
                if (!rep_data->reportdata)
                        return -ENOMEM;

                /* Copy REPORTDATA from the user buffer */
                if (copy_from_user(rep_data->reportdata, buffer, length)) {
                        kfree(rep_data->reportdata);
                        rep_data->reportdata = NULL;
                        return -EFAULT;
                }

                rep_data->reportdata_len = length;

                return length;
        default:
                pr_debug("cmd %d not supported\n", areq->cmd);
                break;
        }

        return 0;
}


static int tdx_attest_open(struct inode *inode, struct file *filp)
{
        struct tdx_attest_req *areq;

        /* Allocate space to track attestation request */
        areq = kmalloc(sizeof(*areq), GFP_KERNEL);
        if (!areq)
                return -ENOMEM;

        filp->private_data = areq;

        return 0;
}

static int tdx_attest_release(struct inode *inode, struct file *filp)
{
        kfree(filp->private_data);
        filp->private_data = NULL;

        return 0;
}

static const struct file_operations tdx_attest_fops = {
        .owner          = THIS_MODULE,
        .open           = tdx_attest_open,
        .read           = tdx_attest_read,
        .write          = tdx_attest_write,
        .unlocked_ioctl = tdx_attest_ioctl,
        .release        = tdx_attest_release,
        .llseek         = no_llseek,
};

>>> How many of these "drivers" are we going to need which are thinly veiled
>>> ioctl()s that are only TDX module call wrappers?
> 
> This is actually a really big question.  This code is obviously just
> trying to do one thing very narrowly and not thinking about the future
> at all.  Can we please consider what the future will be like and try to
> *architect* this instead of having the kernel just play a glorified game
> of telephone?

I am not very clear about other possible use cases. 

Kirill/Kai/Isaku, Do you think we might need similar infrastructure for any
other TDX Module calls or TDVMCALLs?


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-06-30 23:50         ` Sathyanarayanan Kuppuswamy
@ 2022-07-05 12:07           ` Kai Huang
  2022-07-05 18:45             ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kai Huang @ 2022-07-05 12:07 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-06-30 at 16:50 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi,
> 
> On 6/27/22 10:24 AM, Dave Hansen wrote:
> > On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:
> 
> > > > Second, how can someone test this code?  It appears that they need to
> > > > assemble a veritable Rube Goldberg machine.  The least we could do is
> > > > have a selftest that just calls the ioctl() and makes sure that
> > > > something halfway sane comes out of it.
> > > 
> > > My initial submission included a test app. But I removed it later to
> > > reduce the review load. I thought to submit the test app once feature
> > > support patches are merged.
> > > 
> > > https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
> > > 
> > > If you prefer, I can add it back to the next submission with the latest changes.
> > 
> > I doubt anyone will ever run a "test app".  Why not just make this a
> > selftest?
> 
> Fine with me. I can change it into selftest.
> 
> > 
> > > > > In such
> > > > > case, since REPORTDATA is a secret, using sysfs to share it is insecure
> > > > > compared to sending it via IOCTL.
> > > > 
> > > > Huh?  How is sysfs "insecure"?
> > > 
> > > REPORTDATA (input) we pass to the Module call might come from attestation
> > > service as a per session unique ID.  If it is shared via sysfs, there is
> > > a possibility for untrusted software to read it and trigger some form of
> > > reply attack. So in this context, sysfs is insecure compared to IOCTL
> > > approach. You can find the related discussion in,
> > > 
> > > https://lore.kernel.org/lkml/b8eadd3079101a2cf93ee87d36dbedf93d8a2725.camel@intel.com/
> > 
> > I still don't get it.
> > 
> > How is a 400 sysfs file "insecure"?  This sounds "if the filesystem
> > permissions are broken" paranoia.  In other words, you're protecting
> > against a malicious root user.
> 
> AFAIK, root is not the only user of the attestation interface. General users can
> also use it (For example, in a use case like pytorch, attestation may be requested
> by server before passing the training data). So if the permission is not "root
> only", then other users or application in TD can access the sysfs file to read
> its contents. 
> 
> Also, the security concern mentioned is just an additional point. Following are
> the main reasons for choosing IOCTL over sysfs.
> 
> 1. Sysfs is generally used for config purposes. Not for request/response kind of
>    use case (Attestation falls under this category).
> 2. If we only use sysfs model for GetReport, the design might look like below: 
> 
>     /sys/kernel/../report/input
>     /sys/kernel/../report/subtype
>     /sys/kernel/../report/input_len
>     /sys/kernel/../report/output
>     /sys/kernel/../report/output_len
>     /sys/kernel/../report/status
>     /sys/kernel/../report/trigger

I don't think you need all those if using /sysfs approach.  You only need
'reportdata' and 'tdreport' to start with (see below my old reply).  

	echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
	cat /sys/kernel/coco/tdx/attest/tdreport

	Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver
	to call TDCALL to get the TDREPORT, which is available at 
	'/sys/.../tdreport'.

You can add more (such as subtype) in the future if needed (and I doubt it will
ever happen) but this doesn't break any existing ABI.  'output/output_len' also
not needed, kernel can return the report with right size.

Btw, although the /sysfs may not be as secure as IOCTL -- as you mentioned
above, other programs with the same permission can get the TD report by reading
/sysfs and use it as a "reply attack" -- but I am not sure whether such
"potential reply attack" is a true issue or not.  For instance, typically the
attestation service should already have established a secure TLS connection with
TD attestation agent before it provides the 'nonce' (reportdata), and the
attestation should reject the TD report from other connection, etc.

> 
> We might need similar files for GetQuote use case as well. IMO, such a design is 
> more complicated than using IOCTL.

Using /sysfs for TD report doesn't necessarily mean you must use /sysfs for
Quote.  I don't think we should mixing them up.  For instance, even if we use
/dev/xxx for getting TD report, we can use a separate device node for getting
the Quote:

	/dev/tdreport
	/dev/tdquote

I believe there should be pros/cons comparing to using single /dev/attest, but I
haven't thought this very carefully.


> > 
> > In other words, I don't think the ABI here has been well thought out.
> > 
> > Also, this is basically a:
> > 
> > 	Inputs:
> > 
> > 		* nonce
> > 		* report type
> > 
> > 	Outputs:
> > 
> > 		* report
> > 
> > I have to wonder how hard it would be to have this be:
> > 
> > 	fd = open("/dev/foo");
> > 	ioctl(fd, REPORT, type, flags??);
> > 	write(fd, &inputs, inputs_len);
> > 	read(fd,  &output, outputs_len);

It looks like the kernel and userspace still need data structures to agree on
the input/output data format.  I guess with this approach, we can start with
what we need now, and if we need more in the future, we define new data
structures for input and output?

> > 
> 
> It is not hard to implement it this way. But I don't see it being
> very different from just using IOCTL. config/{read/write} model is
> usually preferred for applications in which you have a requirement to do
> multiple read/writes after one time config (use cases like serial
> port read, printer write or reading temperature, etc). But in our case
> we will mostly use it once after every config. 
> 
> Also, splitting input over IOCTL and write system call will require us
> to add additional code to store the state.
> 
> I have attempted a sample implementation (untested) for reference. But I
> feel our current code is simpler. It handles allocation and input/output
> validation in one place compared to spreading it across multiple handlers. 
> 
> struct report_req {
>         int subtype;
>         void *reportdata;
>         int reportdata_len;
> };
> 
> struct tdx_attest_req {
>         unsigned int cmd;
>         void *data;
> };

Is it supposed to be used for Quote too?

I dislike the idea of mixing up getting TD report and getting Quote (make TD
report and Quote both as a sub-commands, etc).

As we have adequately put, the new IOCTLs to support getting Quote isn't even
mandatory -- we just need some communication channel between TD attestation
agent and the QE, such as vsock.

> 
> 
> static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
>                              unsigned long arg)
> {
>         void __user *argp = (void __user *)arg;
>         struct tdx_attest_req *areq = file->private_data;
>         struct report_req *rep_data = NULL;
>         struct tdx_report_req user_req;
>         long ret = -EINVAL;
> 
>         switch (cmd) {
>         case TDX_CMD_GET_REPORT:
>                 /* Allocate space for TDREPORT request */
>                 rep_data = kmalloc(sizeof(struct report_req), GFP_KERNEL);
>                 if (!rep_data)
>                         return -ENOMEM;
> 
>                 /* Copy user request data */
>                 if (copy_from_user(&user_req, argp, sizeof(user_req))) {
>                         kfree(rep_data);
>                         return -EFAULT;
>                 }
> 
>                 /* Copy user request details to areq->data */
>                 rep_data->subtype = user_req.subtype;
>                 areq->cmd = cmd;
>                 areq->data = rep_data;
> 
>                 ret = 0;
>                 break;
>         default:
>                 pr_debug("cmd %d not supported\n", cmd);
>                 break;
>         }
> 
>         return ret;
> }
> 
> static ssize_t tdx_attest_read(struct file *filp, char __user *buffer,
>                                 size_t length, loff_t *offset)
> {
>         struct tdx_attest_req *areq = filp->private_data;
>         struct report_req *rep_data;
>         void *tdreport;
>         long ret;
> 
>         if (!areq)
>                 return -EINVAL;
> 
>         switch (areq->cmd) {
>         case TDX_CMD_GET_REPORT:
> 
>                 /* Check for valid length and offset */
>                 if (length != TDX_REPORT_LEN || offset != 0)
>                         return -EINVAL;
> 
>                 rep_data = areq->data;
> 
>                 /* Make sure we have valid reportdata */
>                 if (!rep_data->reportdata)
>                         return -EINVAL;
> 
>                 /* Allocate space for output data */
>                 tdreport = kzalloc(length, GFP_KERNEL);
>                 if (!tdreport)
>                         return -ENOMEM;
>                 /*
>                  * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>                  *
>                  * Get the TDREPORT using REPORTDATA as input. Refer to
>                  * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
>                  * Specification for detailed information.
>                  */
>                 ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>                                 virt_to_phys(rep_data->reportdata), 0, 0, NULL);
>                 if (ret) {
>                         pr_debug("TDREPORT TDCALL failed, status:%lx\n", ret);
>                        kfree(tdreport);
>                         return -EIO;
>                 }
> 
>                 /* Copy REPORTDATA from the user buffer */
>                 if (copy_to_user(buffer, tdreport, length)) {
>                         kfree(tdreport);
>                         return -EFAULT;
>                 }
> 
>                 return length;
>         default:
>                 pr_debug("cmd %d not supported\n", areq->cmd);
>                 break;
>         }
> 
>         return 0;
> }
> 
> static ssize_t tdx_attest_write(struct file *filp, const char __user *buffer,
>                                 size_t length, loff_t *offset)
> {
>         struct tdx_attest_req *areq = filp->private_data;
>         struct report_req *rep_data;
> 
>         if (!areq)
>                 return -EINVAL;
> 
>         switch (areq->cmd) {
>         case TDX_CMD_GET_REPORT:
> 
>                 /* Check for valid length and offset */
>                 if (length != TDX_REPORTDATA_LEN || offset != 0)
>                         return -EINVAL;
> 
>                 rep_data = areq->data;
> 
>                 /* Allocate space to store REPORTDATA */
>                 rep_data->reportdata = kzalloc(length, GFP_KERNEL);
>                 if (!rep_data->reportdata)
>                         return -ENOMEM;
> 
>                 /* Copy REPORTDATA from the user buffer */
>                 if (copy_from_user(rep_data->reportdata, buffer, length)) {
>                         kfree(rep_data->reportdata);
>                         rep_data->reportdata = NULL;
>                         return -EFAULT;
>                 }
> 
>                 rep_data->reportdata_len = length;
> 
>                 return length;
>         default:
>                 pr_debug("cmd %d not supported\n", areq->cmd);
>                 break;
>         }
> 
>         return 0;
> }
> 
> 
> static int tdx_attest_open(struct inode *inode, struct file *filp)
> {
>         struct tdx_attest_req *areq;
> 
>         /* Allocate space to track attestation request */
>         areq = kmalloc(sizeof(*areq), GFP_KERNEL);
>         if (!areq)
>                 return -ENOMEM;
> 
>         filp->private_data = areq;
> 
>         return 0;
> }
> 
> static int tdx_attest_release(struct inode *inode, struct file *filp)
> {
>         kfree(filp->private_data);
>         filp->private_data = NULL;
> 
>         return 0;
> }
> 
> static const struct file_operations tdx_attest_fops = {
>         .owner          = THIS_MODULE,
>         .open           = tdx_attest_open,
>         .read           = tdx_attest_read,
>         .write          = tdx_attest_write,
>         .unlocked_ioctl = tdx_attest_ioctl,
>         .release        = tdx_attest_release,
>         .llseek         = no_llseek,
> };
> 
> > > > How many of these "drivers" are we going to need which are thinly veiled
> > > > ioctl()s that are only TDX module call wrappers?
> > 
> > This is actually a really big question.  This code is obviously just
> > trying to do one thing very narrowly and not thinking about the future
> > at all.  Can we please consider what the future will be like and try to
> > *architect* this instead of having the kernel just play a glorified game
> > of telephone?
> 
> I am not very clear about other possible use cases. 
> 
> Kirill/Kai/Isaku, Do you think we might need similar infrastructure for any
> other TDX Module calls or TDVMCALLs?
> 
> 

So far only attestation related TDCALL and TDVMCALL requires interaction to
userspace.

For attestation, conceptually, we need two "sets" of ABIs: 1) around getting the
TD report; 2) around getting the Quote.

For 1) we are discussing above. 

For 2), currently we have only GetQuote TDVMCALL.  It's very possible we will
need more sub-commands around Quote (not only get the Quote) -- logically, Quote
generation service defines Quote related commands anyway.  

Theoretically, we only need one TDVMCALL (or a fixed set of TDVMCALLs) for
sending/receiving data as a communication channel (as an alternative to vsock,
etc) to support any Quote related sub-commands, but it seems we are not heading
that way.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-06-28  1:15         ` Kai Huang
@ 2022-07-05 15:29           ` Kirill A. Shutemov
  2022-07-18 14:22             ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kirill A. Shutemov @ 2022-07-05 15:29 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Kirill A. Shutemov, Kuppuswamy Sathyanarayanan,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, Jun 28, 2022 at 01:15:21PM +1200, Kai Huang wrote:
> 
> > 
> > I still don't like the idea of using the DMA API itself.  But, maybe we
> > need some common infrastructure that the DMA API and this code use which
> > says, "get me some pages that I can safely make shared".
> 
> Right.  For instance any KVM PV feature would require shared memory, and DMA API
> normally doesn't fit (taking 'struct kvm_steal_time' as example).
> 
> Maybe we can reserve a CMA for this purpose.

CMA for couple low traffic users sounds like an overkill. It will create
an separate pool just for them.

I think the best way is to add an dummy device and couple of helpers
around DMA API that would allow to tap into swiotlb.

Maybe hide it inside CC infrastructure. Like cc_decrypted_alloc() and
cc_decrypted_free().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-05 12:07           ` Kai Huang
@ 2022-07-05 18:45             ` Sathyanarayanan Kuppuswamy
  2022-07-05 18:52               ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-05 18:45 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi,

On 7/5/22 5:07 AM, Kai Huang wrote:
> On Thu, 2022-06-30 at 16:50 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi,
>>
>> On 6/27/22 10:24 AM, Dave Hansen wrote:
>>> On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:
>>
>>>>> Second, how can someone test this code?  It appears that they need to
>>>>> assemble a veritable Rube Goldberg machine.  The least we could do is
>>>>> have a selftest that just calls the ioctl() and makes sure that
>>>>> something halfway sane comes out of it.
>>>>
>>>> My initial submission included a test app. But I removed it later to
>>>> reduce the review load. I thought to submit the test app once feature
>>>> support patches are merged.
>>>>
>>>> https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
>>>>
>>>> If you prefer, I can add it back to the next submission with the latest changes.
>>>
>>> I doubt anyone will ever run a "test app".  Why not just make this a
>>> selftest?
>>
>> Fine with me. I can change it into selftest.
>>
>>>
>>>>>> In such
>>>>>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>>>>>> compared to sending it via IOCTL.
>>>>>
>>>>> Huh?  How is sysfs "insecure"?
>>>>
>>>> REPORTDATA (input) we pass to the Module call might come from attestation
>>>> service as a per session unique ID.  If it is shared via sysfs, there is
>>>> a possibility for untrusted software to read it and trigger some form of
>>>> reply attack. So in this context, sysfs is insecure compared to IOCTL
>>>> approach. You can find the related discussion in,
>>>>
>>>> https://lore.kernel.org/lkml/b8eadd3079101a2cf93ee87d36dbedf93d8a2725.camel@intel.com/
>>>
>>> I still don't get it.
>>>
>>> How is a 400 sysfs file "insecure"?  This sounds "if the filesystem
>>> permissions are broken" paranoia.  In other words, you're protecting
>>> against a malicious root user.
>>
>> AFAIK, root is not the only user of the attestation interface. General users can
>> also use it (For example, in a use case like pytorch, attestation may be requested
>> by server before passing the training data). So if the permission is not "root
>> only", then other users or application in TD can access the sysfs file to read
>> its contents. 
>>
>> Also, the security concern mentioned is just an additional point. Following are
>> the main reasons for choosing IOCTL over sysfs.
>>
>> 1. Sysfs is generally used for config purposes. Not for request/response kind of
>>    use case (Attestation falls under this category).
>> 2. If we only use sysfs model for GetReport, the design might look like below: 
>>
>>     /sys/kernel/../report/input
>>     /sys/kernel/../report/subtype
>>     /sys/kernel/../report/input_len
>>     /sys/kernel/../report/output
>>     /sys/kernel/../report/output_len
>>     /sys/kernel/../report/status
>>     /sys/kernel/../report/trigger
> 
> I don't think you need all those if using /sysfs approach.  You only need
> 'reportdata' and 'tdreport' to start with (see below my old reply).  
> 
> 	echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
> 	cat /sys/kernel/coco/tdx/attest/tdreport
> 
> 	Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver
> 	to call TDCALL to get the TDREPORT, which is available at 
> 	'/sys/.../tdreport'.
> 
> You can add more (such as subtype) in the future if needed (and I doubt it will
> ever happen) but this doesn't break any existing ABI.  'output/output_len' also
> not needed, kernel can return the report with right size.

I have included *_len files for readability. You can get away with it. But I think
you still need a status file to understand whether TDCALL is successful or not.
Another way is, you can try to write -1 or something to tdreport file if failed. 

Yes, there are many ways to design this. But I still think IOCTL is a better fit
for the request/response kind of use case.

I will let the maintainer decide whether you want to take this route.

> 
> Btw, although the /sysfs may not be as secure as IOCTL -- as you mentioned
> above, other programs with the same permission can get the TD report by reading
> /sysfs and use it as a "reply attack" -- but I am not sure whether such
> "potential reply attack" is a true issue or not.  For instance, typically the
> attestation service should already have established a secure TLS connection with
> TD attestation agent before it provides the 'nonce' (reportdata), and the
> attestation should reject the TD report from other connection, etc.

As I have mentioned, security issue is just an additional point. If this is not
very valid or real, I can remove it.

> 
>>
>> We might need similar files for GetQuote use case as well. IMO, such a design is 
>> more complicated than using IOCTL.
> 
> Using /sysfs for TD report doesn't necessarily mean you must use /sysfs for
> Quote.  I don't think we should mixing them up.  For instance, even if we use
> /dev/xxx for getting TD report, we can use a separate device node for getting
> the Quote:
> 
> 	/dev/tdreport
> 	/dev/tdquote
> 
> I believe there should be pros/cons comparing to using single /dev/attest, but I
> haven't thought this very carefully.

Yes. There are pros and cons for different approaches. But I personally think
using one device and IOCTL ABI model is a simpler design for this use case. AMD also
uses a similar model for their attestation use case.

> 
> 
>>>
>>> In other words, I don't think the ABI here has been well thought out.
>>>
>>> Also, this is basically a:
>>>
>>> 	Inputs:
>>>
>>> 		* nonce
>>> 		* report type
>>>
>>> 	Outputs:
>>>
>>> 		* report
>>>
>>> I have to wonder how hard it would be to have this be:
>>>
>>> 	fd = open("/dev/foo");
>>> 	ioctl(fd, REPORT, type, flags??);
>>> 	write(fd, &inputs, inputs_len);
>>> 	read(fd,  &output, outputs_len);
> 
> It looks like the kernel and userspace still need data structures to agree on
> the input/output data format.  I guess with this approach, we can start with
> what we need now, and if we need more in the future, we define new data
> structures for input and output?
> 
>>>
>>
>> It is not hard to implement it this way. But I don't see it being
>> very different from just using IOCTL. config/{read/write} model is
>> usually preferred for applications in which you have a requirement to do
>> multiple read/writes after one time config (use cases like serial
>> port read, printer write or reading temperature, etc). But in our case
>> we will mostly use it once after every config. 
>>
>> Also, splitting input over IOCTL and write system call will require us
>> to add additional code to store the state.
>>
>> I have attempted a sample implementation (untested) for reference. But I
>> feel our current code is simpler. It handles allocation and input/output
>> validation in one place compared to spreading it across multiple handlers. 
>>
>> struct report_req {
>>         int subtype;
>>         void *reportdata;
>>         int reportdata_len;
>> };
>>
>> struct tdx_attest_req {
>>         unsigned int cmd;
>>         void *data;
>> };
> 
> Is it supposed to be used for Quote too?

Yes. It is to save the command request in kernel during IOCTL request.

> 
> I dislike the idea of mixing up getting TD report and getting Quote (make TD
> report and Quote both as a sub-commands, etc).
> 
> As we have adequately put, the new IOCTLs to support getting Quote isn't even
> mandatory -- we just need some communication channel between TD attestation
> agent and the QE, such as vsock.

Yes. There are other ways to get the Quote. But we have a requirement to support
TDVMCALL based method.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-05 18:45             ` Sathyanarayanan Kuppuswamy
@ 2022-07-05 18:52               ` Dave Hansen
  2022-07-05 21:21                 ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-07-05 18:52 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Kai Huang, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On 7/5/22 11:45, Sathyanarayanan Kuppuswamy wrote:
> Yes, there are many ways to design this. But I still think IOCTL is a better fit
> for the request/response kind of use case.

Are there any other similar ABIs in the kernel?  What kind of mechanism
do they use?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-05 18:52               ` Dave Hansen
@ 2022-07-05 21:21                 ` Sathyanarayanan Kuppuswamy
  2022-07-05 22:31                   ` Kai Huang
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-05 21:21 UTC (permalink / raw)
  To: Dave Hansen, Kai Huang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 7/5/22 11:52 AM, Dave Hansen wrote:
> On 7/5/22 11:45, Sathyanarayanan Kuppuswamy wrote:
>> Yes, there are many ways to design this. But I still think IOCTL is a better fit
>> for the request/response kind of use case.
> 
> Are there any other similar ABIs in the kernel?  What kind of mechanism
> do they use?

AMD also plans to add attestation support. It also uses IOCTL approach.

https://patchwork.kernel.org/project/linux-mm/patch/20220307213356.2797205-44-brijesh.singh@amd.com/

SGX is a related feature. It also uses IOCTL approach for enclave provisioning.

arch/x86/kernel/cpu/sgx/ioctl.c

Other examples (not very related) are,

drivers/platform/chrome/cros_ec_chardev.c - It is an embedded controller driver which
has IOCTL support to read memory region from the device.
drivers/s390/crypto/pkey_api.c - It has IOCTL interfaces to read/write/generate crypto
keys.
drivers/crypto/ccp/sev-dev.c - sev_ioctl() has some IOCTL to set/get keys.
drivers/platform/x86/intel_scu_ipcutil.c - Uses IOCTL to read contents of registers.




-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-05 21:21                 ` Sathyanarayanan Kuppuswamy
@ 2022-07-05 22:31                   ` Kai Huang
  2022-07-06 22:27                     ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kai Huang @ 2022-07-05 22:31 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
> SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
> 
> arch/x86/kernel/cpu/sgx/ioctl.c

SGX isn't a good example here.  The IOCTLs are used to create enclaves, but not
for attestation.  SGX attestation relies on enclave itself to get/verify the
report, etc, so has no interaction with the kernel.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-05 22:31                   ` Kai Huang
@ 2022-07-06 22:27                     ` Sathyanarayanan Kuppuswamy
  2022-07-06 22:59                       ` Kai Huang
  2022-07-18 22:52                       ` Sathyanarayanan Kuppuswamy
  0 siblings, 2 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-06 22:27 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 7/5/22 3:31 PM, Kai Huang wrote:
> On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
>> SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
>>
>> arch/x86/kernel/cpu/sgx/ioctl.c
> 
> SGX isn't a good example here.  The IOCTLs are used to create enclaves, but not
> for attestation.  SGX attestation relies on enclave itself to get/verify the
> report, etc, so has no interaction with the kernel.

If you are looking for an attestation specific example, you can only check the AMD
code.

https://patchwork.kernel.org/project/linux-mm/patch/20220307213356.2797205-44-brijesh.singh@amd.com/

Also, sev_get_attestation_report() in arch/x86/kvm/svm/sev.c also implements attestation
IOCTL specific to KVM.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-06 22:27                     ` Sathyanarayanan Kuppuswamy
@ 2022-07-06 22:59                       ` Kai Huang
  2022-07-18 22:52                       ` Sathyanarayanan Kuppuswamy
  1 sibling, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-07-06 22:59 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Wed, 2022-07-06 at 15:27 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 7/5/22 3:31 PM, Kai Huang wrote:
> > On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
> > > SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
> > > 
> > > arch/x86/kernel/cpu/sgx/ioctl.c
> > 
> > SGX isn't a good example here.  The IOCTLs are used to create enclaves, but not
> > for attestation.  SGX attestation relies on enclave itself to get/verify the
> > report, etc, so has no interaction with the kernel.
> 
> If you are looking for an attestation specific example, you can only check the AMD
> code.
> 
> https://patchwork.kernel.org/project/linux-mm/patch/20220307213356.2797205-44-brijesh.singh@amd.com/
> 
> Also, sev_get_attestation_report() in arch/x86/kvm/svm/sev.c also implements attestation
> IOCTL specific to KVM.
> 

I think we only need to look at how attestation is implemented in kernel for
other vendors, so yes it would be helpful to look at AMD's implementation.  
I'll probably also look at it when I have some time.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-06-27 11:21           ` Kai Huang
  2022-06-27 14:56             ` Sathyanarayanan Kuppuswamy
@ 2022-07-14  0:46             ` Sathyanarayanan Kuppuswamy
  2022-07-14 10:42               ` Kai Huang
  1 sibling, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-14  0:46 UTC (permalink / raw)
  To: Kai Huang, Yao, Jiewen, Nakajima, Jun
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Luck, Tony, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, LKML

Hi Kai/Dave,

On 6/27/22 4:21 AM, Kai Huang wrote:
> On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
>> Thank you, Jun.
>>
>> Yes. I confirmed that we will include below change to GHCI.next spec.
>>
>> ================
>> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>>
>> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>>
>> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>>
> 
> Hi Sathy,
> 
> With this change, I don't think we should use system vector anymore.  Instead,
> we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
> 

Thanks. As suggested, I have attempted to allocate IRQ vector at runtime
using irq_domain_alloc_irqs() call. Vector is allocated from
"x86_vector_domain" as Kai suggested.

Since I am not well versed in this area, I would like expert comments on it.
Mainly for IRQ allocation logic in tdx_late_init(). I have tested this version using
QEMU and it works fine.


diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 928dcf7a20d9..dcc878546574 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,12 +5,16 @@
 #define pr_fmt(fmt)     "tdx: " fmt
 
 #include <linux/cpufeature.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/numa.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/irqdomain.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO                   1
@@ -19,6 +23,7 @@
 
 /* TDX hypercall Leaf IDs */
 #define TDVMCALL_MAP_GPA               0x10001
+#define TDVMCALL_SETUP_NOTIFY_INTR     0x10004
 
 /* MMIO direction */
 #define EPT_READ       0
@@ -34,6 +39,26 @@
 #define VE_GET_PORT_NUM(e)     ((e) >> 16)
 #define VE_IS_IO_STRING(e)     ((e) & BIT(4))
 
+/*
+ * Handler used to report notifications about
+ * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
+ * used only by the attestation driver. So, race condition
+ * with read/write operation is not considered.
+ */
+static void (*tdx_event_notify_handler)(void);
+
+/* Helper function to register tdx_event_notify_handler */
+void tdx_setup_ev_notify_handler(void (*handler)(void))
+{
+       tdx_event_notify_handler = handler;
+}
+
+/* Helper function to unregister tdx_event_notify_handler */
+void tdx_remove_ev_notify_handler(void)
+{
+       tdx_event_notify_handler = NULL;
+}
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -98,6 +123,31 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
                panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
 }
 
+/*
+ * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
+ *
+ * @vector: Vector address to be used for notification.
+ *
+ * return 0 on success or failure error number.
+ */
+static long tdx_hcall_set_notify_intr(u8 vector)
+{
+       /* Minimum vector value allowed is 32 */
+       if (vector < 32)
+               return -EINVAL;
+
+       /*
+        * Register callback vector address with VMM. More details
+        * about the ABI can be found in TDX Guest-Host-Communication
+        * Interface (GHCI), sec titled
+        * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
+        */
+       if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
+               return -EIO;
+
+       return 0;
+}
+
 static u64 get_cc_mask(void)
 {
        struct tdx_module_output out;
@@ -775,3 +825,52 @@ void __init tdx_early_init(void)
 
        pr_info("Guest detected\n");
 }
+
+static irqreturn_t tdx_ev_handler(int irq, void *dev_id)
+{
+       tdx_event_notify_handler();
+       return IRQ_HANDLED;
+}
+
+static int __init tdx_late_init(void)
+{
+        struct irq_alloc_info info;
+       struct irq_cfg *cfg;
+       int evirq, cpu;
+
+       if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+               return 0;
+
+       if (!x86_vector_domain) {
+               pr_err("x86 vector domain is NULL\n");
+               return 0;
+       }
+
+       init_irq_alloc_info(&info, NULL);
+
+       evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
+
+       cpu = get_cpu();
+
+       irq_set_handler(evirq, handle_edge_irq);
+
+       /*
+        * Event notification vector will be delivered to the CPU
+        * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
+        * So set the IRQ affinity to the current CPU.
+        */
+       irq_set_affinity(evirq, cpumask_of(cpu));
+
+       if (request_irq(evirq, tdx_ev_handler, 0, "tdx_evirq", NULL))
+               pr_err("Request event IRQ failed\n");
+
+       cfg = irq_cfg(evirq);
+
+       if (tdx_hcall_set_notify_intr(cfg->vector))
+               pr_err("Setting event notification interrupt failed\n");
+
+       put_cpu();
+
+       return 0;
+}
+late_initcall(tdx_late_init);
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..eb4db837cc44 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,10 @@ void tdx_safe_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
+void tdx_setup_ev_notify_handler(void (*handler)(void));
+
+void tdx_remove_ev_notify_handler(void);
+
 #else
 
 static inline void tdx_early_init(void) { };


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-07-14  0:46             ` Sathyanarayanan Kuppuswamy
@ 2022-07-14 10:42               ` Kai Huang
  2022-07-14 20:55                 ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kai Huang @ 2022-07-14 10:42 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Yao, Jiewen, Nakajima, Jun
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Luck, Tony, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, LKML

On Wed, 2022-07-13 at 17:46 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai/Dave,
> 
> On 6/27/22 4:21 AM, Kai Huang wrote:
> > On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
> > > Thank you, Jun.
> > > 
> > > Yes. I confirmed that we will include below change to GHCI.next spec.
> > > 
> > > ================
> > > 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
> > > 
> > > From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
> > > 
> > > To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
> > > 
> > 
> > Hi Sathy,
> > 
> > With this change, I don't think we should use system vector anymore.  Instead,
> > we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
> > 
> 
> Thanks. As suggested, I have attempted to allocate IRQ vector at runtime
> using irq_domain_alloc_irqs() call. Vector is allocated from
> "x86_vector_domain" as Kai suggested.

I am not expert either. I said "idea only" in my first reply :)

> 
> Since I am not well versed in this area, I would like expert comments on it.
> Mainly for IRQ allocation logic in tdx_late_init(). I have tested this version using
> QEMU and it works fine.
> 
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 928dcf7a20d9..dcc878546574 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -5,12 +5,16 @@
>  #define pr_fmt(fmt)     "tdx: " fmt
>  
>  #include <linux/cpufeature.h>
> +#include <linux/interrupt.h>
> +#include <linux/irq.h>
> +#include <linux/numa.h>
>  #include <asm/coco.h>
>  #include <asm/tdx.h>
>  #include <asm/vmx.h>
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/irqdomain.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO                   1
> @@ -19,6 +23,7 @@
>  
>  /* TDX hypercall Leaf IDs */
>  #define TDVMCALL_MAP_GPA               0x10001
> +#define TDVMCALL_SETUP_NOTIFY_INTR     0x10004
>  
>  /* MMIO direction */
>  #define EPT_READ       0
> @@ -34,6 +39,26 @@
>  #define VE_GET_PORT_NUM(e)     ((e) >> 16)
>  #define VE_IS_IO_STRING(e)     ((e) & BIT(4))
>  
> +/*
> + * Handler used to report notifications about
> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> + * used only by the attestation driver. So, race condition
> + * with read/write operation is not considered.
> + */
> +static void (*tdx_event_notify_handler)(void);
> +
> +/* Helper function to register tdx_event_notify_handler */
> +void tdx_setup_ev_notify_handler(void (*handler)(void))
> +{
> +       tdx_event_notify_handler = handler;
> +}
> +
> +/* Helper function to unregister tdx_event_notify_handler */
> +void tdx_remove_ev_notify_handler(void)
> +{
> +       tdx_event_notify_handler = NULL;
> +}
> +

Looks it's weird that you still need it.  Couldn't we pass the function to
handle GetQuote directly to request_irq()?

>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -98,6 +123,31 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>                 panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>  }
>  
> +/*
> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
> + *
> + * @vector: Vector address to be used for notification.
> + *
> + * return 0 on success or failure error number.
> + */
> +static long tdx_hcall_set_notify_intr(u8 vector)
> +{
> +       /* Minimum vector value allowed is 32 */
> +       if (vector < 32)
> +               return -EINVAL;
> +
> +       /*
> +        * Register callback vector address with VMM. More details
> +        * about the ABI can be found in TDX Guest-Host-Communication
> +        * Interface (GHCI), sec titled
> +        * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
> +        */
> +       if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
> +               return -EIO;
> +
> +       return 0;
> +}
> +
>  static u64 get_cc_mask(void)
>  {
>         struct tdx_module_output out;
> @@ -775,3 +825,52 @@ void __init tdx_early_init(void)
>  
>         pr_info("Guest detected\n");
>  }
> +
> +static irqreturn_t tdx_ev_handler(int irq, void *dev_id)
> +{
> +       tdx_event_notify_handler();
> +       return IRQ_HANDLED;
> +}
> +
> +static int __init tdx_late_init(void)
> +{
> +        struct irq_alloc_info info;
> +       struct irq_cfg *cfg;
> +       int evirq, cpu;
> +
> +       if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> +               return 0;
> +
> +       if (!x86_vector_domain) {
> +               pr_err("x86 vector domain is NULL\n");
> +               return 0;
> +       }
> +
> +       init_irq_alloc_info(&info, NULL);
> +
> +       evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);

If I read correctly, if you set info->mask to cpumask_of(cpu), and pass it to
irq_domain_alloc_irqs(), the x86_vector_domain.alloc will immediately allocate a
vector on the given cpu, so you don't need to call irq_set_affinity() and wait
to allocate the vector on _this_ cpu until request_irq().

> +
> +       cpu = get_cpu();
> +
> +       irq_set_handler(evirq, handle_edge_irq);
> +
> +       /*
> +        * Event notification vector will be delivered to the CPU
> +        * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
> +        * So set the IRQ affinity to the current CPU.
> +        */
> +       irq_set_affinity(evirq, cpumask_of(cpu));
> +
> +       if (request_irq(evirq, tdx_ev_handler, 0, "tdx_evirq", NULL))
> +               pr_err("Request event IRQ failed\n");
> +
> +       cfg = irq_cfg(evirq);
> +
> +       if (tdx_hcall_set_notify_intr(cfg->vector))
> +               pr_err("Setting event notification interrupt failed\n");
> +
> +       put_cpu();

So the IRQ's affinity isn't kernel managed.  Also looks it doesn't have anything
like IRQF_NOBALANCING to prevent it from being migrated.  If I understand
correctly, userspace can still change the affinity which could cause the vector
being reallocated on another cpu?

Perhaps you can pass IRQF_NO_BALANCING to request_irq()?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-07-14 10:42               ` Kai Huang
@ 2022-07-14 20:55                 ` Sathyanarayanan Kuppuswamy
  2022-07-14 23:58                   ` Kai Huang
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-14 20:55 UTC (permalink / raw)
  To: Kai Huang, Yao, Jiewen, Nakajima, Jun
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Luck, Tony, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, LKML



On 7/14/22 3:42 AM, Kai Huang wrote:
> On Wed, 2022-07-13 at 17:46 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi Kai/Dave,
>>
>> On 6/27/22 4:21 AM, Kai Huang wrote:
>>> On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
>>>> Thank you, Jun.
>>>>
>>>> Yes. I confirmed that we will include below change to GHCI.next spec.
>>>>
>>>> ================
>>>> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>>>>
>>>> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>>>>
>>>> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>>>>
>>>
>>> Hi Sathy,
>>>
>>> With this change, I don't think we should use system vector anymore.  Instead,
>>> we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
>>>
>>
>> Thanks. As suggested, I have attempted to allocate IRQ vector at runtime
>> using irq_domain_alloc_irqs() call. Vector is allocated from
>> "x86_vector_domain" as Kai suggested.
> 
> I am not expert either. I said "idea only" in my first reply :)
> 
>>
>> Since I am not well versed in this area, I would like expert comments on it.
>> Mainly for IRQ allocation logic in tdx_late_init(). I have tested this version using
>> QEMU and it works fine.
>>
>>
>> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
>> index 928dcf7a20d9..dcc878546574 100644
>> --- a/arch/x86/coco/tdx/tdx.c
>> +++ b/arch/x86/coco/tdx/tdx.c
>> @@ -5,12 +5,16 @@
>>  #define pr_fmt(fmt)     "tdx: " fmt
>>  
>>  #include <linux/cpufeature.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/irq.h>
>> +#include <linux/numa.h>
>>  #include <asm/coco.h>
>>  #include <asm/tdx.h>
>>  #include <asm/vmx.h>
>>  #include <asm/insn.h>
>>  #include <asm/insn-eval.h>
>>  #include <asm/pgtable.h>
>> +#include <asm/irqdomain.h>
>>  
>>  /* TDX module Call Leaf IDs */
>>  #define TDX_GET_INFO                   1
>> @@ -19,6 +23,7 @@
>>  
>>  /* TDX hypercall Leaf IDs */
>>  #define TDVMCALL_MAP_GPA               0x10001
>> +#define TDVMCALL_SETUP_NOTIFY_INTR     0x10004
>>  
>>  /* MMIO direction */
>>  #define EPT_READ       0
>> @@ -34,6 +39,26 @@
>>  #define VE_GET_PORT_NUM(e)     ((e) >> 16)
>>  #define VE_IS_IO_STRING(e)     ((e) & BIT(4))
>>  
>> +/*
>> + * Handler used to report notifications about
>> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
>> + * used only by the attestation driver. So, race condition
>> + * with read/write operation is not considered.
>> + */
>> +static void (*tdx_event_notify_handler)(void);
>> +
>> +/* Helper function to register tdx_event_notify_handler */
>> +void tdx_setup_ev_notify_handler(void (*handler)(void))
>> +{
>> +       tdx_event_notify_handler = handler;
>> +}
>> +
>> +/* Helper function to unregister tdx_event_notify_handler */
>> +void tdx_remove_ev_notify_handler(void)
>> +{
>> +       tdx_event_notify_handler = NULL;
>> +}
>> +
> 
> Looks it's weird that you still need it.  Couldn't we pass the function to
> handle GetQuote directly to request_irq()?


Since event notification is not only specific to attestation, I did not want to
directly tie it to GetQuote handler.

Alternative approach is, to make this IRQ vector shared and let its users do
request_irq() and free_irq() directly.

Something like below?

--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -399,7 +399,7 @@ static const struct file_operations tdx_attest_fops = {
 
 static int __init tdx_attestation_init(void)
 {
-       int ret;
+       int ret, irq;
 
        /* Make sure we are in a valid TDX platform */
        if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
@@ -422,6 +422,14 @@ static int __init tdx_attestation_init(void)
                return ret;
        }
 
+       irq = tdx_get_evirq();
+
+       if (irq < 0)
+               return -EIO;
+
+       if (request_irq(irq, quote_callback_handler, IRQF_NOBALANCING|IRQF_SHARED,
+                               "tdx_quote", &miscdev))
+               pr_err("Register GetQuote IRQ handler failed\n");
        return 0;
 }
 device_initcall(tdx_attestation_init)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 2d09a7a4005e..e2e043fe488d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -41,6 +41,13 @@
 #define VE_GET_PORT_NUM(e)     ((e) >> 16)
 #define VE_IS_IO_STRING(e)     ((e) & BIT(4))
 
+static int tdx_evirq = -1;
+
+int tdx_get_evirq(void)
+{
+       return tdx_evirq;
+}
+
 /*
  * Handler used to report notifications about
  * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
@@ -866,16 +873,16 @@ static int __init tdx_late_init(void)
         */
        info.mask = cpumask_of(cpu);
 
-       evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
+       tdx_evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
 
-       irq_set_handler(evirq, handle_edge_irq);
+       irq_set_handler(tdx_evirq, handle_edge_irq);
 
-       cfg = irq_cfg(evirq);
+       cfg = irq_cfg(tdx_evirq);
 
        if (tdx_hcall_set_notify_intr(cfg->vector))
                pr_err("Setting event notification interrupt failed\n");

-       if (request_irq(evirq, tdx_ev_handler, IRQF_NOBALANCING, "tdx_evirq", NULL))
-               pr_err("Request event IRQ failed\n");

 
        put_cpu();
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eb4db837cc44..79b5220f5f5d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -71,12 +71,15 @@ void tdx_setup_ev_notify_handler(void (*handler)(void));
 
 void tdx_remove_ev_notify_handler(void);
 
+int tdx_get_evirq(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
 static inline void tdx_safe_halt(void) { };
 
 static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+static inline int tdx_get_evirq(void) { return -1; };
 
 #endif /* CONFIG_INTEL_TDX_GUEST */



> 
>>  /*
>>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>>   * return code.
>> @@ -98,6 +123,31 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>>                 panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>>  }
>>  
>> +/*
>> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
>> + *
>> + * @vector: Vector address to be used for notification.
>> + *
>> + * return 0 on success or failure error number.
>> + */
>> +static long tdx_hcall_set_notify_intr(u8 vector)
>> +{
>> +       /* Minimum vector value allowed is 32 */
>> +       if (vector < 32)
>> +               return -EINVAL;
>> +
>> +       /*
>> +        * Register callback vector address with VMM. More details
>> +        * about the ABI can be found in TDX Guest-Host-Communication
>> +        * Interface (GHCI), sec titled
>> +        * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
>> +        */
>> +       if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
>> +               return -EIO;
>> +
>> +       return 0;
>> +}
>> +
>>  static u64 get_cc_mask(void)
>>  {
>>         struct tdx_module_output out;
>> @@ -775,3 +825,52 @@ void __init tdx_early_init(void)
>>  
>>         pr_info("Guest detected\n");
>>  }
>> +
>> +static irqreturn_t tdx_ev_handler(int irq, void *dev_id)
>> +{
>> +       tdx_event_notify_handler();
>> +       return IRQ_HANDLED;
>> +}
>> +
>> +static int __init tdx_late_init(void)
>> +{
>> +        struct irq_alloc_info info;
>> +       struct irq_cfg *cfg;
>> +       int evirq, cpu;
>> +
>> +       if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>> +               return 0;
>> +
>> +       if (!x86_vector_domain) {
>> +               pr_err("x86 vector domain is NULL\n");
>> +               return 0;
>> +       }
>> +
>> +       init_irq_alloc_info(&info, NULL);
>> +
>> +       evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
> 
> If I read correctly, if you set info->mask to cpumask_of(cpu), and pass it to
> irq_domain_alloc_irqs(), the x86_vector_domain.alloc will immediately allocate a
> vector on the given cpu, so you don't need to call irq_set_affinity() and wait
> to allocate the vector on _this_ cpu until request_irq().

Agreed. I moved it to info.mask.

> 
>> +
>> +       cpu = get_cpu();
>> +
>> +       irq_set_handler(evirq, handle_edge_irq);
>> +
>> +       /*
>> +        * Event notification vector will be delivered to the CPU
>> +        * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
>> +        * So set the IRQ affinity to the current CPU.
>> +        */
>> +       irq_set_affinity(evirq, cpumask_of(cpu));
>> +
>> +       if (request_irq(evirq, tdx_ev_handler, 0, "tdx_evirq", NULL))
>> +               pr_err("Request event IRQ failed\n");
>> +
>> +       cfg = irq_cfg(evirq);
>> +
>> +       if (tdx_hcall_set_notify_intr(cfg->vector))
>> +               pr_err("Setting event notification interrupt failed\n");
>> +
>> +       put_cpu();
> 
> So the IRQ's affinity isn't kernel managed.  Also looks it doesn't have anything
> like IRQF_NOBALANCING to prevent it from being migrated.  If I understand
> correctly, userspace can still change the affinity which could cause the vector
> being reallocated on another cpu?
> 
> Perhaps you can pass IRQF_NO_BALANCING to request_irq()?

Makes sense. Tried it and it works. I will include this change.

> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support
  2022-07-14 20:55                 ` Sathyanarayanan Kuppuswamy
@ 2022-07-14 23:58                   ` Kai Huang
  0 siblings, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-07-14 23:58 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Yao, Jiewen, Nakajima, Jun
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Luck, Tony, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, LKML


> > > +/*
> > > + * Handler used to report notifications about
> > > + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> > > + * used only by the attestation driver. So, race condition
> > > + * with read/write operation is not considered.
> > > + */
> > > +static void (*tdx_event_notify_handler)(void);
> > > +
> > > +/* Helper function to register tdx_event_notify_handler */
> > > +void tdx_setup_ev_notify_handler(void (*handler)(void))
> > > +{
> > > +       tdx_event_notify_handler = handler;
> > > +}
> > > +
> > > +/* Helper function to unregister tdx_event_notify_handler */
> > > +void tdx_remove_ev_notify_handler(void)
> > > +{
> > > +       tdx_event_notify_handler = NULL;
> > > +}
> > > +
> > 
> > Looks it's weird that you still need it.  Couldn't we pass the function to
> > handle GetQuote directly to request_irq()?
> 
> 
> Since event notification is not only specific to attestation, I did not want to
> directly tie it to GetQuote handler.
> 
> Alternative approach is, to make this IRQ vector shared and let its users do
> request_irq() and free_irq() directly.

I think IRQF_SHARED is the logic.  We don't need it now as for now we have only
one user, though.

The "multiple users" case may never happen, so doesn't justify the additional
'tdx_event_notify_handler' code at this moment.

> 
> Something like below?
> 
> --- a/arch/x86/coco/tdx/attest.c
> +++ b/arch/x86/coco/tdx/attest.c
> @@ -399,7 +399,7 @@ static const struct file_operations tdx_attest_fops = {
>  
>  static int __init tdx_attestation_init(void)
>  {
> -       int ret;
> +       int ret, irq;
>  
>         /* Make sure we are in a valid TDX platform */
>         if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> @@ -422,6 +422,14 @@ static int __init tdx_attestation_init(void)
>                 return ret;
>         }
>  
> +       irq = tdx_get_evirq();

The function seems unnecessary.  You can put 'extern int tdx_evtirq;' to
asm/tdx.h if needed and use it directly.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-05 15:29           ` Kirill A. Shutemov
@ 2022-07-18 14:22             ` Sathyanarayanan Kuppuswamy
  2022-07-19 16:13               ` Kirill A. Shutemov
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-18 14:22 UTC (permalink / raw)
  To: Kirill A. Shutemov, Kai Huang
  Cc: Dave Hansen, Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 7/5/22 8:29 AM, Kirill A. Shutemov wrote:
>>> I still don't like the idea of using the DMA API itself.  But, maybe we
>>> need some common infrastructure that the DMA API and this code use which
>>> says, "get me some pages that I can safely make shared".
>> Right.  For instance any KVM PV feature would require shared memory, and DMA API
>> normally doesn't fit (taking 'struct kvm_steal_time' as example).
>>
>> Maybe we can reserve a CMA for this purpose.
> CMA for couple low traffic users sounds like an overkill. It will create
> an separate pool just for them.
> 
> I think the best way is to add an dummy device and couple of helpers
> around DMA API that would allow to tap into swiotlb.
> 
> Maybe hide it inside CC infrastructure. Like cc_decrypted_alloc() and
> cc_decrypted_free().

I also think creating a generic device in the CC layer, and using it to allocate
memory via DMA APIs is a better approach for this issue. Following is the sample
implementation to give you an idea on how it looks. Please let me know
your comments.

cc_shmem_alloc/free() APIs can be used by CC users to allocate and
free shared memory.

Other vendors can define their own shared memory allocation and free
logic via struct shmem_priv alloc/free/init hooks.

--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -3,6 +3,6 @@ CFLAGS_REMOVE_core.o    = -pg
 KASAN_SANITIZE_core.o  := n
 CFLAGS_core.o          += -fno-stack-protector
 
-obj-y += core.o
+obj-y += core.o shmem.o
 
 obj-$(CONFIG_INTEL_TDX_GUEST)  += tdx/
diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 49b44f881484..62fe68d1f60a 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -13,7 +13,7 @@
 #include <asm/coco.h>
 #include <asm/processor.h>
 
-static enum cc_vendor vendor __ro_after_init;
+enum cc_vendor vendor __ro_after_init;
 static u64 cc_mask __ro_after_init;
 
 static bool intel_cc_platform_has(enum cc_attr attr)
@@ -23,6 +23,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
        case CC_ATTR_HOTPLUG_DISABLED:
        case CC_ATTR_GUEST_MEM_ENCRYPT:
        case CC_ATTR_MEM_ENCRYPT:
+       case CC_ATTR_SHMEM:
                return true;
        default:
                return false;
@@ -134,6 +135,11 @@ __init void cc_set_vendor(enum cc_vendor v)
        vendor = v;
 }
 
+enum cc_vendor cc_get_vendor(void)
+{
+       return vendor;
+}
+
 __init void cc_set_mask(u64 mask)
 {
        cc_mask = mask;

+++ b/arch/x86/coco/shmem.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Confidential Computing Shared Memory Allocator
+ *
+ * Copyright (C) 2022 Intel Corporation, Inc.
+ *
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "CC: " fmt
+
+#include <linux/export.h>
+#include <linux/cma.h>
+#include <linux/mm.h>
+#include <linux/cc_platform.h>
+#include <linux/set_memory.h>
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/coco.h>
+#include <asm/processor.h>
+
+#define CC_SHMEM_DRIVER                "cc-shmem"
+
+struct platform_device *shmem_pdev;
+
+struct shmem_priv
+{
+       int (*init)(struct platform_device *dev);
+       void* (*alloc)(size_t size, gfp_t gfp, struct shmem_priv *priv);
+       void (*free)(void *addr, size_t size, struct shmem_priv *priv);
+       struct platform_device *pdev;
+       void *data;
+};
+
+void *cc_shmem_alloc(size_t size, gfp_t gfp)
+{
+       struct shmem_priv *priv;
+
+       if (!shmem_pdev)
+               return NULL;
+
+       priv = platform_get_drvdata(shmem_pdev);
+
+       return priv->alloc(size, gfp, priv);
+}
+
+void cc_shmem_free(void *addr, size_t size)
+{
+       struct shmem_priv *priv;
+
+       if (!shmem_pdev)
+               return;
+
+       priv = platform_get_drvdata(shmem_pdev);
+
+       priv->free(addr, size, priv);
+}
+
+static int intel_shmem_init(struct platform_device *pdev)
+{
+       struct shmem_priv *priv;
+       dma_addr_t *handle;
+
+       handle = devm_kmalloc(&pdev->dev, sizeof(*handle), GFP_KERNEL);
+       if (!handle)
+               return -ENOMEM;
+
+       priv = platform_get_drvdata(pdev);
+
+       priv->data = handle;
+
+       return dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64));
+}
+
+static void *intel_shmem_alloc(size_t size, gfp_t gfp, struct shmem_priv *priv)
+{
+       dma_addr_t *handle = (dma_addr_t *) priv->data;
+
+       return dma_alloc_coherent(&priv->pdev->dev, size, handle, gfp);
+}
+
+static void intel_shmem_free(void *addr, size_t size, struct shmem_priv *priv)
+{
+       dma_addr_t *handle = (dma_addr_t *) priv->data;
+
+       return dma_free_coherent(&priv->pdev->dev, size, addr, *handle);
+}
+
+static struct shmem_priv intel_shmem = {
+       .init  = intel_shmem_init,
+       .alloc = intel_shmem_alloc,
+       .free  = intel_shmem_free,
+};
+
+static int cc_set_drv_data(struct platform_device *pdev)
+{
+       enum cc_vendor vendor = cc_get_vendor();
+
+       switch (vendor) {
+       case CC_VENDOR_INTEL:
+               platform_set_drvdata(pdev, &intel_shmem);
+               return 0;
+       default:
+               break;
+       }
+
+       return -ENODEV;
+}
+
+static int cc_shmem_probe(struct platform_device *pdev)
+{
+       struct shmem_priv *priv;
+       void *data;
+
+       if (cc_set_drv_data(pdev))
+               return -ENODEV;
+
+       priv = platform_get_drvdata(pdev);
+
+       shmem_pdev = priv->pdev = pdev;
+
+       if (priv->init(pdev))
+               return -EIO;
+
+       return 0;
+}
+
+static struct platform_driver cc_shmem_driver = {
+       .probe  = cc_shmem_probe,
+       .driver = {
+               .name = CC_SHMEM_DRIVER,
+       },
+};
+
+static int __init cc_shmem_init(void)
+{
+       struct platform_device *pdev;
+       int ret;
+
+       if (!cc_platform_has(CC_ATTR_SHMEM))
+               return -ENODEV;
+
+       ret =  platform_driver_register(&cc_shmem_driver);
+       if (ret)
+               return ret;
+
+       pdev = platform_device_register_simple(CC_SHMEM_DRIVER, -1, NULL, 0);
+       if (IS_ERR(pdev)) {
+               platform_driver_unregister(&cc_shmem_driver);
+               return PTR_ERR(pdev);
+       }
+
+       return 0;
+}
+device_initcall(cc_shmem_init);
diff --git a/arch/x86/include/asm/coco.h b/arch/x86/include/asm/coco.h
index 3d98c3a60d34..5c259135c8fc 100644
--- a/arch/x86/include/asm/coco.h
+++ b/arch/x86/include/asm/coco.h
@@ -12,11 +12,14 @@ enum cc_vendor {
 };
 
 void cc_set_vendor(enum cc_vendor v);
+enum cc_vendor cc_get_vendor(void);
 void cc_set_mask(u64 mask);
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
 u64 cc_mkenc(u64 val);
 u64 cc_mkdec(u64 val);
+void *cc_shmem_alloc(size_t size, gfp_t gfp);
+void cc_shmem_free(void *addr, size_t size);
 #else
 static inline u64 cc_mkenc(u64 val)
 {
@@ -27,6 +30,10 @@ static inline u64 cc_mkdec(u64 val)
 {
        return val;
 }
+
+void *cc_shmem_alloc(size_t size, gfp_t gfp) { return NULL; }
+void cc_shmem_free(void *addr, size_t size) { }
+
 #endif
 
 #endif /* _ASM_X86_COCO_H */
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index cb0d6cd1c12f..50e0db22056f 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -90,6 +90,17 @@ enum cc_attr {
         * Examples include TDX Guest.
         */
        CC_ATTR_HOTPLUG_DISABLED,
+
+       /**
+        * @CC_ATTR_SHMEM: CC shared memory is supported.
+        *
+        * The platform/OS is running as a guest/virtual machine
+        * can allocate shared memory using cc_shmem_{alloc/free}()
+        * APIs.
+        *
+        */
+       CC_ATTR_SHMEM,
+
 };
 
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver
  2022-07-06 22:27                     ` Sathyanarayanan Kuppuswamy
  2022-07-06 22:59                       ` Kai Huang
@ 2022-07-18 22:52                       ` Sathyanarayanan Kuppuswamy
  1 sibling, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-18 22:52 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 7/6/22 3:27 PM, Sathyanarayanan Kuppuswamy wrote:
> 
> 
> On 7/5/22 3:31 PM, Kai Huang wrote:
>> On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
>>> SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
>>>
>>> arch/x86/kernel/cpu/sgx/ioctl.c
>>
>> SGX isn't a good example here.  The IOCTLs are used to create enclaves, but not
>> for attestation.  SGX attestation relies on enclave itself to get/verify the
>> report, etc, so has no interaction with the kernel.
> 
> If you are looking for an attestation specific example, you can only check the AMD
> code.
> 
> https://patchwork.kernel.org/project/linux-mm/patch/20220307213356.2797205-44-brijesh.singh@amd.com/
> 
> Also, sev_get_attestation_report() in arch/x86/kvm/svm/sev.c also implements attestation
> IOCTL specific to KVM.
> 

As mentioned above, AMD also uses IOCTL model for attestation use case. So you agree
with us using similar approach here? Please let me know your comments.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-18 14:22             ` Sathyanarayanan Kuppuswamy
@ 2022-07-19 16:13               ` Kirill A. Shutemov
  2022-07-19 17:10                 ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kirill A. Shutemov @ 2022-07-19 16:13 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Kirill A. Shutemov, Kai Huang, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On Mon, Jul 18, 2022 at 07:22:52AM -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Dave,
> 
> On 7/5/22 8:29 AM, Kirill A. Shutemov wrote:
> >>> I still don't like the idea of using the DMA API itself.  But, maybe we
> >>> need some common infrastructure that the DMA API and this code use which
> >>> says, "get me some pages that I can safely make shared".
> >> Right.  For instance any KVM PV feature would require shared memory, and DMA API
> >> normally doesn't fit (taking 'struct kvm_steal_time' as example).
> >>
> >> Maybe we can reserve a CMA for this purpose.
> > CMA for couple low traffic users sounds like an overkill. It will create
> > an separate pool just for them.
> > 
> > I think the best way is to add an dummy device and couple of helpers
> > around DMA API that would allow to tap into swiotlb.
> > 
> > Maybe hide it inside CC infrastructure. Like cc_decrypted_alloc() and
> > cc_decrypted_free().
> 
> I also think creating a generic device in the CC layer, and using it to allocate
> memory via DMA APIs is a better approach for this issue. Following is the sample
> implementation to give you an idea on how it looks. Please let me know
> your comments.
> 
> cc_shmem_alloc/free() APIs can be used by CC users to allocate and
> free shared memory.

We usually use 'decrypted' term in kernel. cc_decrypted_alloc()/_free().

> Other vendors can define their own shared memory allocation and free
> logic via struct shmem_priv alloc/free/init hooks.
> 
> --- a/arch/x86/coco/Makefile
> +++ b/arch/x86/coco/Makefile
> @@ -3,6 +3,6 @@ CFLAGS_REMOVE_core.o    = -pg
>  KASAN_SANITIZE_core.o  := n
>  CFLAGS_core.o          += -fno-stack-protector
>  
> -obj-y += core.o
> +obj-y += core.o shmem.o

Rename shmem.o -> mem.o ?

> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
> index 49b44f881484..62fe68d1f60a 100644
> --- a/arch/x86/coco/core.c
> +++ b/arch/x86/coco/core.c
> @@ -13,7 +13,7 @@
>  #include <asm/coco.h>
>  #include <asm/processor.h>
>  
> -static enum cc_vendor vendor __ro_after_init;
> +enum cc_vendor vendor __ro_after_init;
>  static u64 cc_mask __ro_after_init;
>  
>  static bool intel_cc_platform_has(enum cc_attr attr)
> @@ -23,6 +23,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
>         case CC_ATTR_HOTPLUG_DISABLED:
>         case CC_ATTR_GUEST_MEM_ENCRYPT:
>         case CC_ATTR_MEM_ENCRYPT:
> +       case CC_ATTR_SHMEM:
>                 return true;
>         default:
>                 return false;

I don't think we need a new attribute. CC_ATTR_MEM_ENCRYPT has to be
enough.

> @@ -134,6 +135,11 @@ __init void cc_set_vendor(enum cc_vendor v)
>         vendor = v;
>  }
>  
> +enum cc_vendor cc_get_vendor(void)
> +{
> +       return vendor;
> +}
> +
>  __init void cc_set_mask(u64 mask)
>  {
>         cc_mask = mask;
> 
> +++ b/arch/x86/coco/shmem.c
> @@ -0,0 +1,156 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Confidential Computing Shared Memory Allocator
> + *
> + * Copyright (C) 2022 Intel Corporation, Inc.
> + *
> + */
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt)     "CC: " fmt
> +
> +#include <linux/export.h>
> +#include <linux/cma.h>
> +#include <linux/mm.h>
> +#include <linux/cc_platform.h>
> +#include <linux/set_memory.h>
> +#include <linux/platform_device.h>
> +#include <linux/dma-mapping.h>
> +
> +#include <asm/coco.h>
> +#include <asm/processor.h>
> +
> +#define CC_SHMEM_DRIVER                "cc-shmem"
> +
> +struct platform_device *shmem_pdev;
> +
> +struct shmem_priv
> +{
> +       int (*init)(struct platform_device *dev);
> +       void* (*alloc)(size_t size, gfp_t gfp, struct shmem_priv *priv);
> +       void (*free)(void *addr, size_t size, struct shmem_priv *priv);
> +       struct platform_device *pdev;
> +       void *data;
> +};
> +
> +void *cc_shmem_alloc(size_t size, gfp_t gfp)
> +{
> +       struct shmem_priv *priv;
> +
> +       if (!shmem_pdev)
> +               return NULL;
> +
> +       priv = platform_get_drvdata(shmem_pdev);
> +
> +       return priv->alloc(size, gfp, priv);
> +}
> +
> +void cc_shmem_free(void *addr, size_t size)
> +{
> +       struct shmem_priv *priv;
> +
> +       if (!shmem_pdev)
> +               return;
> +
> +       priv = platform_get_drvdata(shmem_pdev);
> +
> +       priv->free(addr, size, priv);
> +}
> +
> +static int intel_shmem_init(struct platform_device *pdev)
> +{
> +       struct shmem_priv *priv;
> +       dma_addr_t *handle;
> +
> +       handle = devm_kmalloc(&pdev->dev, sizeof(*handle), GFP_KERNEL);
> +       if (!handle)
> +               return -ENOMEM;
> +
> +       priv = platform_get_drvdata(pdev);
> +
> +       priv->data = handle;
> +
> +       return dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64));
> +}
> +
> +static void *intel_shmem_alloc(size_t size, gfp_t gfp, struct shmem_priv *priv)
> +{
> +       dma_addr_t *handle = (dma_addr_t *) priv->data;
> +
> +       return dma_alloc_coherent(&priv->pdev->dev, size, handle, gfp);
> +}
> +
> +static void intel_shmem_free(void *addr, size_t size, struct shmem_priv *priv)
> +{
> +       dma_addr_t *handle = (dma_addr_t *) priv->data;
> +
> +       return dma_free_coherent(&priv->pdev->dev, size, addr, *handle);
> +}
> +
> +static struct shmem_priv intel_shmem = {
> +       .init  = intel_shmem_init,
> +       .alloc = intel_shmem_alloc,
> +       .free  = intel_shmem_free,
> +};

Hm. What is Intel-specific here. Looks like a generic thing, no?

Maybe just drop all vendor stuff. CC_ATTR_MEM_ENCRYPT should be enough.


-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-19 16:13               ` Kirill A. Shutemov
@ 2022-07-19 17:10                 ` Sathyanarayanan Kuppuswamy
  2022-07-19 21:55                   ` Kirill A. Shutemov
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-19 17:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Kai Huang, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel



On 7/19/22 9:13 AM, Kirill A. Shutemov wrote:
> On Mon, Jul 18, 2022 at 07:22:52AM -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi Dave,
>>
>> On 7/5/22 8:29 AM, Kirill A. Shutemov wrote:
>>>>> I still don't like the idea of using the DMA API itself.  But, maybe we
>>>>> need some common infrastructure that the DMA API and this code use which
>>>>> says, "get me some pages that I can safely make shared".
>>>> Right.  For instance any KVM PV feature would require shared memory, and DMA API
>>>> normally doesn't fit (taking 'struct kvm_steal_time' as example).
>>>>
>>>> Maybe we can reserve a CMA for this purpose.
>>> CMA for couple low traffic users sounds like an overkill. It will create
>>> an separate pool just for them.
>>>
>>> I think the best way is to add an dummy device and couple of helpers
>>> around DMA API that would allow to tap into swiotlb.
>>>
>>> Maybe hide it inside CC infrastructure. Like cc_decrypted_alloc() and
>>> cc_decrypted_free().
>>
>> I also think creating a generic device in the CC layer, and using it to allocate
>> memory via DMA APIs is a better approach for this issue. Following is the sample
>> implementation to give you an idea on how it looks. Please let me know
>> your comments.
>>
>> cc_shmem_alloc/free() APIs can be used by CC users to allocate and
>> free shared memory.
> 
> We usually use 'decrypted' term in kernel. cc_decrypted_alloc()/_free().

Fine with me. I will use it.

> 
>> Other vendors can define their own shared memory allocation and free
>> logic via struct shmem_priv alloc/free/init hooks.
>>
>> --- a/arch/x86/coco/Makefile
>> +++ b/arch/x86/coco/Makefile
>> @@ -3,6 +3,6 @@ CFLAGS_REMOVE_core.o    = -pg
>>  KASAN_SANITIZE_core.o  := n
>>  CFLAGS_core.o          += -fno-stack-protector
>>  
>> -obj-y += core.o
>> +obj-y += core.o shmem.o
> 
> Rename shmem.o -> mem.o ?

Ok.

> 
>> diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
>> index 49b44f881484..62fe68d1f60a 100644
>> --- a/arch/x86/coco/core.c
>> +++ b/arch/x86/coco/core.c
>> @@ -13,7 +13,7 @@
>>  #include <asm/coco.h>
>>  #include <asm/processor.h>
>>  
>> -static enum cc_vendor vendor __ro_after_init;
>> +enum cc_vendor vendor __ro_after_init;
>>  static u64 cc_mask __ro_after_init;
>>  
>>  static bool intel_cc_platform_has(enum cc_attr attr)
>> @@ -23,6 +23,7 @@ static bool intel_cc_platform_has(enum cc_attr attr)
>>         case CC_ATTR_HOTPLUG_DISABLED:
>>         case CC_ATTR_GUEST_MEM_ENCRYPT:
>>         case CC_ATTR_MEM_ENCRYPT:
>> +       case CC_ATTR_SHMEM:
>>                 return true;
>>         default:
>>                 return false;
> 
> I don't think we need a new attribute. CC_ATTR_MEM_ENCRYPT has to be
> enough.

Ok. Make sense.

> 
>> @@ -134,6 +135,11 @@ __init void cc_set_vendor(enum cc_vendor v)
>>         vendor = v;
>>  }
>>  
>> +enum cc_vendor cc_get_vendor(void)
>> +{
>> +       return vendor;
>> +}
>> +
>>  __init void cc_set_mask(u64 mask)
>>  {
>>         cc_mask = mask;
>>
>> +++ b/arch/x86/coco/shmem.c
>> @@ -0,0 +1,156 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Confidential Computing Shared Memory Allocator
>> + *
>> + * Copyright (C) 2022 Intel Corporation, Inc.
>> + *
>> + */
>> +
>> +#undef pr_fmt
>> +#define pr_fmt(fmt)     "CC: " fmt
>> +
>> +#include <linux/export.h>
>> +#include <linux/cma.h>
>> +#include <linux/mm.h>
>> +#include <linux/cc_platform.h>
>> +#include <linux/set_memory.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/dma-mapping.h>
>> +
>> +#include <asm/coco.h>
>> +#include <asm/processor.h>
>> +
>> +#define CC_SHMEM_DRIVER                "cc-shmem"
>> +
>> +struct platform_device *shmem_pdev;
>> +
>> +struct shmem_priv
>> +{
>> +       int (*init)(struct platform_device *dev);
>> +       void* (*alloc)(size_t size, gfp_t gfp, struct shmem_priv *priv);
>> +       void (*free)(void *addr, size_t size, struct shmem_priv *priv);
>> +       struct platform_device *pdev;
>> +       void *data;
>> +};
>> +
>> +void *cc_shmem_alloc(size_t size, gfp_t gfp)
>> +{
>> +       struct shmem_priv *priv;
>> +
>> +       if (!shmem_pdev)
>> +               return NULL;
>> +
>> +       priv = platform_get_drvdata(shmem_pdev);
>> +
>> +       return priv->alloc(size, gfp, priv);
>> +}
>> +
>> +void cc_shmem_free(void *addr, size_t size)
>> +{
>> +       struct shmem_priv *priv;
>> +
>> +       if (!shmem_pdev)
>> +               return;
>> +
>> +       priv = platform_get_drvdata(shmem_pdev);
>> +
>> +       priv->free(addr, size, priv);
>> +}
>> +
>> +static int intel_shmem_init(struct platform_device *pdev)
>> +{
>> +       struct shmem_priv *priv;
>> +       dma_addr_t *handle;
>> +
>> +       handle = devm_kmalloc(&pdev->dev, sizeof(*handle), GFP_KERNEL);
>> +       if (!handle)
>> +               return -ENOMEM;
>> +
>> +       priv = platform_get_drvdata(pdev);
>> +
>> +       priv->data = handle;
>> +
>> +       return dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64));
>> +}
>> +
>> +static void *intel_shmem_alloc(size_t size, gfp_t gfp, struct shmem_priv *priv)
>> +{
>> +       dma_addr_t *handle = (dma_addr_t *) priv->data;
>> +
>> +       return dma_alloc_coherent(&priv->pdev->dev, size, handle, gfp);
>> +}
>> +
>> +static void intel_shmem_free(void *addr, size_t size, struct shmem_priv *priv)
>> +{
>> +       dma_addr_t *handle = (dma_addr_t *) priv->data;
>> +
>> +       return dma_free_coherent(&priv->pdev->dev, size, addr, *handle);
>> +}
>> +
>> +static struct shmem_priv intel_shmem = {
>> +       .init  = intel_shmem_init,
>> +       .alloc = intel_shmem_alloc,
>> +       .free  = intel_shmem_free,
>> +};
> 
> Hm. What is Intel-specific here. Looks like a generic thing, no?
> 
> Maybe just drop all vendor stuff. CC_ATTR_MEM_ENCRYPT should be enough.

I thought that not all CC vendors would want to use DMA APIs for shared
buffer allocation. So adding a vendor layer would give them a way to implement
their own model.

> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-19 17:10                 ` Sathyanarayanan Kuppuswamy
@ 2022-07-19 21:55                   ` Kirill A. Shutemov
  2022-07-20 14:56                     ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kirill A. Shutemov @ 2022-07-19 21:55 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Kirill A. Shutemov, Kai Huang, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On Tue, Jul 19, 2022 at 10:10:20AM -0700, Sathyanarayanan Kuppuswamy wrote:
> >> +static struct shmem_priv intel_shmem = {
> >> +       .init  = intel_shmem_init,
> >> +       .alloc = intel_shmem_alloc,
> >> +       .free  = intel_shmem_free,
> >> +};
> > 
> > Hm. What is Intel-specific here. Looks like a generic thing, no?
> > 
> > Maybe just drop all vendor stuff. CC_ATTR_MEM_ENCRYPT should be enough.
> 
> I thought that not all CC vendors would want to use DMA APIs for shared
> buffer allocation. So adding a vendor layer would give them a way to implement
> their own model.

set_memory_decrypted() is gated by CC_ATTR_MEM_ENCRYPT and it is the only
requirement for functionality AFAICS.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-19 21:55                   ` Kirill A. Shutemov
@ 2022-07-20 14:56                     ` Sathyanarayanan Kuppuswamy
  2022-07-20 16:17                       ` Kirill A. Shutemov
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-20 14:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Kai Huang, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

Hi Kirill,

On 7/19/22 2:55 PM, Kirill A. Shutemov wrote:
> On Tue, Jul 19, 2022 at 10:10:20AM -0700, Sathyanarayanan Kuppuswamy wrote:
>>>> +static struct shmem_priv intel_shmem = {
>>>> +       .init  = intel_shmem_init,
>>>> +       .alloc = intel_shmem_alloc,
>>>> +       .free  = intel_shmem_free,
>>>> +};
>>>
>>> Hm. What is Intel-specific here. Looks like a generic thing, no?
>>>
>>> Maybe just drop all vendor stuff. CC_ATTR_MEM_ENCRYPT should be enough.
>>
>> I thought that not all CC vendors would want to use DMA APIs for shared
>> buffer allocation. So adding a vendor layer would give them a way to implement
>> their own model.
> 
> set_memory_decrypted() is gated by CC_ATTR_MEM_ENCRYPT and it is the only
> requirement for functionality AFAICS.

Makes sense. I think CC_ATTR_GUEST_MEM_ENCRYPT is the better check here. It
force enables the SWIOTLB usage.

diff --git a/arch/x86/coco/Makefile b/arch/x86/coco/Makefile
index c816acf78b6a..96fc4ec4497f 100644
--- a/arch/x86/coco/Makefile
+++ b/arch/x86/coco/Makefile
@@ -3,6 +3,6 @@ CFLAGS_REMOVE_core.o    = -pg
 KASAN_SANITIZE_core.o  := n
 CFLAGS_core.o          += -fno-stack-protector
 
-obj-y += core.o
+obj-y += core.o mem.o
 
 obj-$(CONFIG_INTEL_TDX_GUEST)  += tdx/
diff --git a/arch/x86/coco/mem.c b/arch/x86/coco/mem.c
new file mode 100644
index 000000000000..ef76a8accc1e
--- /dev/null
+++ b/arch/x86/coco/mem.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Confidential Computing Decrypted Memory Allocator
+ *
+ * Copyright (C) 2022 Intel Corporation, Inc.
+ *
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "cc/mem: " fmt
+
+#include <linux/export.h>
+#include <linux/mm.h>
+#include <linux/cc_platform.h>
+#include <linux/set_memory.h>
+#include <linux/platform_device.h>
+#include <linux/dma-mapping.h>
+
+#include <asm/coco.h>
+#include <asm/processor.h>
+
+#define CC_MEM_DRIVER          "ccmem"
+
+struct platform_device *mem_pdev;
+dma_addr_t handle;
+
+/* Allocate decrypted memory of given size */
+void *cc_decrypted_alloc(size_t size, gfp_t gfp)
+{
+       if (!mem_pdev)
+               return NULL;
+
+       return dma_alloc_coherent(&mem_pdev->dev, size, &handle, gfp);
+}
+
+/* Free given decrypted memory */
+void cc_decrypted_free(void *addr, size_t size)
+{
+       if (!mem_pdev)
+               return;
+
+       dma_free_coherent(&mem_pdev->dev, size, addr, handle);
+}
+
+static int cc_mem_probe(struct platform_device *pdev)
+{
+       if (dma_set_coherent_mask(&pdev->dev, DMA_BIT_MASK(64)))
+               return -EIO;
+
+       mem_pdev = pdev;
+
+       return 0;
+}
+
+static struct platform_driver cc_mem_driver = {
+       .probe  = cc_mem_probe,
+       .driver = {
+               .name = CC_MEM_DRIVER,
+       },
+};
+
+static int __init cc_mem_init(void)
+{
+       struct platform_device *pdev;
+       int ret;
+
+       if (!cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT))
+               return -ENODEV;
+
+       ret =  platform_driver_register(&cc_mem_driver);
+       if (ret)
+               return ret;
+
+       pdev = platform_device_register_simple(CC_MEM_DRIVER, -1, NULL, 0);
+       if (IS_ERR(pdev)) {
+               platform_driver_unregister(&cc_mem_driver);
+               return PTR_ERR(pdev);
+       }
+
+       return 0;
+}
+device_initcall(cc_mem_init);
diff --git a/arch/x86/include/asm/coco.h b/arch/x86/include/asm/coco.h
index 3d98c3a60d34..9d616d9e3405 100644
--- a/arch/x86/include/asm/coco.h
+++ b/arch/x86/include/asm/coco.h
@@ -17,6 +17,8 @@ void cc_set_mask(u64 mask);
 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM
 u64 cc_mkenc(u64 val);
 u64 cc_mkdec(u64 val);
+void *cc_decrypted_alloc(size_t size, gfp_t gfp);
+void cc_decrypted_free(void *addr, size_t size);
 #else
 static inline u64 cc_mkenc(u64 val)
 {
@@ -27,6 +29,10 @@ static inline u64 cc_mkdec(u64 val)
 {
        return val;
 }
+
+void *cc_decrypted_alloc(size_t size, gfp_t gfp) { return NULL; }
+void cc_decrypted_free(void *addr, size_t size) { }
+
 #endif
 
 #endif /* _ASM_X86_COCO_H 



> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-20 14:56                     ` Sathyanarayanan Kuppuswamy
@ 2022-07-20 16:17                       ` Kirill A. Shutemov
  2022-07-20 16:58                         ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Kirill A. Shutemov @ 2022-07-20 16:17 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Kirill A. Shutemov, Kai Huang, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On Wed, Jul 20, 2022 at 07:56:04AM -0700, Sathyanarayanan Kuppuswamy wrote:
> diff --git a/arch/x86/coco/mem.c b/arch/x86/coco/mem.c
> new file mode 100644
> index 000000000000..ef76a8accc1e
> --- /dev/null
> +++ b/arch/x86/coco/mem.c
> @@ -0,0 +1,82 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Confidential Computing Decrypted Memory Allocator
> + *
> + * Copyright (C) 2022 Intel Corporation, Inc.
> + *
> + */
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt)     "cc/mem: " fmt
> +
> +#include <linux/export.h>
> +#include <linux/mm.h>
> +#include <linux/cc_platform.h>
> +#include <linux/set_memory.h>
> +#include <linux/platform_device.h>
> +#include <linux/dma-mapping.h>
> +
> +#include <asm/coco.h>
> +#include <asm/processor.h>
> +
> +#define CC_MEM_DRIVER          "ccmem"
> +
> +struct platform_device *mem_pdev;

'static'?

> +dma_addr_t handle;

Hm. How does it work with >1 allocation a time?

> +
> +/* Allocate decrypted memory of given size */
> +void *cc_decrypted_alloc(size_t size, gfp_t gfp)
> +{
> +       if (!mem_pdev)
> +               return NULL;
> +
> +       return dma_alloc_coherent(&mem_pdev->dev, size, &handle, gfp);
> +}
> +
> +/* Free given decrypted memory */
> +void cc_decrypted_free(void *addr, size_t size)
> +{
> +       if (!mem_pdev)
> +               return;
> +
> +       dma_free_coherent(&mem_pdev->dev, size, addr, handle);
> +}
> +

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions
  2022-07-20 16:17                       ` Kirill A. Shutemov
@ 2022-07-20 16:58                         ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-20 16:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Kai Huang, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel



On 7/20/22 9:17 AM, Kirill A. Shutemov wrote:
> On Wed, Jul 20, 2022 at 07:56:04AM -0700, Sathyanarayanan Kuppuswamy wrote:
>> diff --git a/arch/x86/coco/mem.c b/arch/x86/coco/mem.c
>> new file mode 100644
>> index 000000000000..ef76a8accc1e
>> --- /dev/null
>> +++ b/arch/x86/coco/mem.c
>> @@ -0,0 +1,82 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Confidential Computing Decrypted Memory Allocator
>> + *
>> + * Copyright (C) 2022 Intel Corporation, Inc.
>> + *
>> + */
>> +
>> +#undef pr_fmt
>> +#define pr_fmt(fmt)     "cc/mem: " fmt
>> +
>> +#include <linux/export.h>
>> +#include <linux/mm.h>
>> +#include <linux/cc_platform.h>
>> +#include <linux/set_memory.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/dma-mapping.h>
>> +
>> +#include <asm/coco.h>
>> +#include <asm/processor.h>
>> +
>> +#define CC_MEM_DRIVER          "ccmem"
>> +
>> +struct platform_device *mem_pdev;
> 
> 'static'?

Agree.

> 
>> +dma_addr_t handle;
> 
> Hm. How does it work with >1 allocation a time?

It is a bug. I think we should make it as parameter to alloc/free APIs.

> 
>> +
>> +/* Allocate decrypted memory of given size */
>> +void *cc_decrypted_alloc(size_t size, gfp_t gfp)
>> +{
>> +       if (!mem_pdev)
>> +               return NULL;
>> +
>> +       return dma_alloc_coherent(&mem_pdev->dev, size, &handle, gfp);
>> +}
>> +
>> +/* Free given decrypted memory */
>> +void cc_decrypted_free(void *addr, size_t size)
>> +{
>> +       if (!mem_pdev)
>> +               return;
>> +
>> +       dma_free_coherent(&mem_pdev->dev, size, addr, handle);
>> +}
>> +
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-06-09  2:52 ` [PATCH v8 5/5] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2022-06-14 12:30   ` Wander Lairson Costa
@ 2022-07-21 16:08   ` Dave Hansen
  2022-07-21 16:42     ` Sathyanarayanan Kuppuswamy
  2022-07-21 23:32     ` Kai Huang
  2022-07-22  0:27   ` Dave Hansen
  2 siblings, 2 replies; 77+ messages in thread
From: Dave Hansen @ 2022-07-21 16:08 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> For shared buffer allocation, alternatives like using the DMA API is
> also considered. Although it simpler to use, it is not preferred because
> dma_alloc_*() APIs require a valid bus device as argument, which would
> need converting the attestation driver into a platform device driver.
> This is unnecessary, and since the attestation driver does not do real
> DMA, there is no need to use real DMA APIs.

Let's actually try to walk through the requirements for the memory
allocation here.

 1. The guest kernel needs to allocate some guest physical memory
    for the attestation data buffer
 2. The guest physical memory must be mapped by the guest so that
    it can be read/written.
 3. The guest mapping must be a "TDX Shared" mapping.  Since all
    guest physical memory is "TDX Private" by default, something
    must convert the memory from Private->Shared.
 4. If there are alias mappings with "TDX Private" page table
    permissions, those mappings must never be used while the page is
    in its shared state.
    4a. load_unaligned_zeropad() must be prevented from being used
	on the page immediately preceding a Private alias to a Shared
 	page.
 5. Actions that increasingly fracture the direct map must be avoided.
    Attestation may happen many times and repeated allocations that
    fracture the direct map have performance consequences.
 6. A softer requirement: presuming that bounce buffers won't be used
    for TDX devices *forever*, it would be nice to use a mechanism that
    will continue to work on systems that don't have swiotlb on.

I think we've talked about three different solutions:

== vmalloc() ==

So, let's say we used a relatively plain vmalloc().  That's great for
#1->#3 as long as the vmalloc() mapping gets the "TDX Shared" bit set
properly on its PTEs.  But, it falls over for *either* #4 or #5.  If it
leaves the direct map alone, it's exposed to load_unaligned_zeropad().
If it unmaps the memory from the direct map, it runs afoul of #5.

== order-1 + vmap() ==

Let's now consider a vmalloc() variant: allocate a bunch of order-1
pages and vmap() page[1], leaving page[0] as a guard page against
load_unaligned_zeropad() on the direct map.  That works, but it's an
annoying amount of code.

== swiotlb pages ==

Using the swiotlb bounce buffer pages is the other proposed option.
They already have a working kernel mapping and have already been
converted.  They are mitigated against load_unaligned_zeropad().  They
do cause direct map fracturing, but only once since they're allocated
statically.  They don't increasingly degrade things.  It's a one-time
cost.  Their interaction with #6 is not great.

Did I miss anything?  Does that accurately capture where we are?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 16:08   ` Dave Hansen
@ 2022-07-21 16:42     ` Sathyanarayanan Kuppuswamy
  2022-07-21 16:49       ` Dave Hansen
  2022-07-21 23:32     ` Kai Huang
  1 sibling, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-21 16:42 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 7/21/22 9:08 AM, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>> For shared buffer allocation, alternatives like using the DMA API is
>> also considered. Although it simpler to use, it is not preferred because
>> dma_alloc_*() APIs require a valid bus device as argument, which would
>> need converting the attestation driver into a platform device driver.
>> This is unnecessary, and since the attestation driver does not do real
>> DMA, there is no need to use real DMA APIs.
> 
> Let's actually try to walk through the requirements for the memory
> allocation here.
> 
>  1. The guest kernel needs to allocate some guest physical memory
>     for the attestation data buffer

Physically contiguous memory.

>  2. The guest physical memory must be mapped by the guest so that
>     it can be read/written.
>  3. The guest mapping must be a "TDX Shared" mapping.  Since all
>     guest physical memory is "TDX Private" by default, something
>     must convert the memory from Private->Shared.
>  4. If there are alias mappings with "TDX Private" page table
>     permissions, those mappings must never be used while the page is
>     in its shared state.
>     4a. load_unaligned_zeropad() must be prevented from being used
> 	on the page immediately preceding a Private alias to a Shared
>  	page.
>  5. Actions that increasingly fracture the direct map must be avoided.
>     Attestation may happen many times and repeated allocations that
>     fracture the direct map have performance consequences.
>  6. A softer requirement: presuming that bounce buffers won't be used
>     for TDX devices *forever*, it would be nice to use a mechanism that
>     will continue to work on systems that don't have swiotlb on.
>

Other than the above-mentioned correction, the rest of the requirements
are correct.
 
> I think we've talked about three different solutions:
> 
> == vmalloc() ==
> 
> So, let's say we used a relatively plain vmalloc().  That's great for
> #1->#3 as long as the vmalloc() mapping gets the "TDX Shared" bit set
> properly on its PTEs.  But, it falls over for *either* #4 or #5.  If it
> leaves the direct map alone, it's exposed to load_unaligned_zeropad().
> If it unmaps the memory from the direct map, it runs afoul of #5
Since we need physically contiguous memory, vmalloc is not preferred.

> 
> == order-1 + vmap() ==
> 
> Let's now consider a vmalloc() variant: allocate a bunch of order-1
> pages and vmap() page[1], leaving page[0] as a guard page against
> load_unaligned_zeropad() on the direct map.  That works, but it's an
> annoying amount of code.
> 
> == swiotlb pages ==
> 
> Using the swiotlb bounce buffer pages is the other proposed option.
> They already have a working kernel mapping and have already been
> converted.  They are mitigated against load_unaligned_zeropad().  They
> do cause direct map fracturing, but only once since they're allocated
> statically.  They don't increasingly degrade things.  It's a one-time
> cost.  Their interaction with #6 is not great.
> 
> Did I miss anything?  Does that accurately capture where we are?



-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 16:42     ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 16:49       ` Dave Hansen
  2022-07-21 16:54         ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-07-21 16:49 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
> On 7/21/22 9:08 AM, Dave Hansen wrote:
>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>> For shared buffer allocation, alternatives like using the DMA API is
>>> also considered. Although it simpler to use, it is not preferred because
>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>> need converting the attestation driver into a platform device driver.
>>> This is unnecessary, and since the attestation driver does not do real
>>> DMA, there is no need to use real DMA APIs.
>> Let's actually try to walk through the requirements for the memory
>> allocation here.
>>
>>  1. The guest kernel needs to allocate some guest physical memory
>>     for the attestation data buffer
> Physically contiguous memory.

Remind me how large the quote structures are.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 16:49       ` Dave Hansen
@ 2022-07-21 16:54         ` Sathyanarayanan Kuppuswamy
  2022-07-21 17:02           ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-21 16:54 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel



On 7/21/22 9:49 AM, Dave Hansen wrote:
> On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
>> On 7/21/22 9:08 AM, Dave Hansen wrote:
>>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>>> For shared buffer allocation, alternatives like using the DMA API is
>>>> also considered. Although it simpler to use, it is not preferred because
>>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>>> need converting the attestation driver into a platform device driver.
>>>> This is unnecessary, and since the attestation driver does not do real
>>>> DMA, there is no need to use real DMA APIs.
>>> Let's actually try to walk through the requirements for the memory
>>> allocation here.
>>>
>>>  1. The guest kernel needs to allocate some guest physical memory
>>>     for the attestation data buffer
>> Physically contiguous memory.
> 
> Remind me how large the quote structures are.

It depends on the attestation service. In addition to TDREPORT (1K size),
during quote generation, additional data can be included in the signed
quote. So the spec allows variable length. User agent will communicate with
attestation service to identify the appropriate buffer length. Our test
uses 8K buffers.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 16:54         ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 17:02           ` Dave Hansen
  2022-07-21 17:16             ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-07-21 17:02 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 7/21/22 09:54, Sathyanarayanan Kuppuswamy wrote:
> 
> On 7/21/22 9:49 AM, Dave Hansen wrote:
>> On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
>>> On 7/21/22 9:08 AM, Dave Hansen wrote:
>>>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>>>> For shared buffer allocation, alternatives like using the DMA API is
>>>>> also considered. Although it simpler to use, it is not preferred because
>>>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>>>> need converting the attestation driver into a platform device driver.
>>>>> This is unnecessary, and since the attestation driver does not do real
>>>>> DMA, there is no need to use real DMA APIs.
>>>> Let's actually try to walk through the requirements for the memory
>>>> allocation here.
>>>>
>>>>  1. The guest kernel needs to allocate some guest physical memory
>>>>     for the attestation data buffer
>>> Physically contiguous memory.
>> Remind me how large the quote structures are.
> It depends on the attestation service. In addition to TDREPORT (1K size),
> during quote generation, additional data can be included in the signed
> quote. So the spec allows variable length. User agent will communicate with
> attestation service to identify the appropriate buffer length. Our test
> uses 8K buffers

What is this "additional data"?  Is that "REPORTDATA" from the TDX
module spec?

	"Additional REPORTDATA, a 64-byte value, is provided by the
	guest TD to be included in the TDG.MR.REPORT."



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 17:02           ` Dave Hansen
@ 2022-07-21 17:16             ` Sathyanarayanan Kuppuswamy
  2022-07-21 17:19               ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-21 17:16 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 7/21/22 10:02 AM, Dave Hansen wrote:
> On 7/21/22 09:54, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 7/21/22 9:49 AM, Dave Hansen wrote:
>>> On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
>>>> On 7/21/22 9:08 AM, Dave Hansen wrote:
>>>>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>>>>> For shared buffer allocation, alternatives like using the DMA API is
>>>>>> also considered. Although it simpler to use, it is not preferred because
>>>>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>>>>> need converting the attestation driver into a platform device driver.
>>>>>> This is unnecessary, and since the attestation driver does not do real
>>>>>> DMA, there is no need to use real DMA APIs.
>>>>> Let's actually try to walk through the requirements for the memory
>>>>> allocation here.
>>>>>
>>>>>  1. The guest kernel needs to allocate some guest physical memory
>>>>>     for the attestation data buffer
>>>> Physically contiguous memory.
>>> Remind me how large the quote structures are.
>> It depends on the attestation service. In addition to TDREPORT (1K size),
>> during quote generation, additional data can be included in the signed
>> quote. So the spec allows variable length. User agent will communicate with
>> attestation service to identify the appropriate buffer length. Our test
>> uses 8K buffers
> 
> What is this "additional data"?  Is that "REPORTDATA" from the TDX
> module spec?
> 
> 	"Additional REPORTDATA, a 64-byte value, is provided by the
> 	guest TD to be included in the TDG.MR.REPORT."


It is not the REPORTDATA. REPORTDATA is already included in the TDREPORT during
REPORT generation process (using TDG.MR.REPORT).

Spec does not clearly define the details about the additional data. I think it is
related to certificates.

Isaku, do you have more details about it?

Current ABI allows attestation service and agent to decide the quote size. So
we can't make assumptions on what that size will be.

Following is copied from TDX Module specification, sec titled "Measurement
and Attestation"

An Intel SGX Quoting Enclave, written specifically to support quoting Intel TDX TDs,
uses a new ENCLU instruction leaf, EVERIFYREPORT2, to help check the integrity of the
TDG.MR.REPORT. If it passes, the Quoting Enclave can use a certified quote signing key
to sign a quote containing the guest TD’s measurements and the additional data being
quoted.

> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 17:16             ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 17:19               ` Dave Hansen
  2022-07-21 18:31                 ` Sathyanarayanan Kuppuswamy
  2022-07-21 18:42                 ` Isaku Yamahata
  0 siblings, 2 replies; 77+ messages in thread
From: Dave Hansen @ 2022-07-21 17:19 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 7/21/22 10:16, Sathyanarayanan Kuppuswamy wrote:
>> What is this "additional data"?  Is that "REPORTDATA" from the TDX
>> module spec?
>>
>> 	"Additional REPORTDATA, a 64-byte value, is provided by the
>> 	guest TD to be included in the TDG.MR.REPORT."
> 
> It is not the REPORTDATA. REPORTDATA is already included in the TDREPORT during
> REPORT generation process (using TDG.MR.REPORT).
> 
> Spec does not clearly define the details about the additional data. I think it is
> related to certificates.
> 
> Isaku, do you have more details about it?
> 
> Current ABI allows attestation service and agent to decide the quote size. So
> we can't make assumptions on what that size will be.
> 
> Following is copied from TDX Module specification, sec titled "Measurement
> and Attestation"
> 
> An Intel SGX Quoting Enclave, written specifically to support quoting Intel TDX TDs,
> uses a new ENCLU instruction leaf, EVERIFYREPORT2, to help check the integrity of the
> TDG.MR.REPORT. If it passes, the Quoting Enclave can use a certified quote signing key
> to sign a quote containing the guest TD’s measurements and the additional data being
> quoted.

<sigh> We're off in the weeds again.

How many bytes does the TDX module read from and write to the guest
physical address space?  What are the absolute limits?  What is the
minimum and the maximum that the kernel needs to handle?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 17:19               ` Dave Hansen
@ 2022-07-21 18:31                 ` Sathyanarayanan Kuppuswamy
  2022-07-21 18:42                 ` Isaku Yamahata
  1 sibling, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-21 18:31 UTC (permalink / raw)
  To: Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 7/21/22 10:19 AM, Dave Hansen wrote:
> How many bytes does the TDX module read from and write to the guest
> physical address space?  What are the absolute limits?  What is the
> minimum and the maximum that the kernel needs to handle?

Minimum is 1K (equal to TDREPORT size, on input Quoting Enclave reads 1K TDREPORT
data from GPA)

Maximum size is not defined in the spec. I think for future compatibility, spec
does not limit the quote size. For most cases, I think 8K should be big enough.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 17:19               ` Dave Hansen
  2022-07-21 18:31                 ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 18:42                 ` Isaku Yamahata
  2022-07-21 18:52                   ` Dave Hansen
  1 sibling, 1 reply; 77+ messages in thread
From: Isaku Yamahata @ 2022-07-21 18:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Kai Huang,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, Jul 21, 2022 at 10:19:30AM -0700,
Dave Hansen <dave.hansen@intel.com> wrote:

> On 7/21/22 10:16, Sathyanarayanan Kuppuswamy wrote:
> >> What is this "additional data"?  Is that "REPORTDATA" from the TDX
> >> module spec?
> >>
> >> 	"Additional REPORTDATA, a 64-byte value, is provided by the
> >> 	guest TD to be included in the TDG.MR.REPORT."
> > 
> > It is not the REPORTDATA. REPORTDATA is already included in the TDREPORT during
> > REPORT generation process (using TDG.MR.REPORT).
> > 
> > Spec does not clearly define the details about the additional data. I think it is
> > related to certificates.
> > 
> > Isaku, do you have more details about it?
> > 
> > Current ABI allows attestation service and agent to decide the quote size. So
> > we can't make assumptions on what that size will be.
> > 
> > Following is copied from TDX Module specification, sec titled "Measurement
> > and Attestation"
> > 
> > An Intel SGX Quoting Enclave, written specifically to support quoting Intel TDX TDs,
> > uses a new ENCLU instruction leaf, EVERIFYREPORT2, to help check the integrity of the
> > TDG.MR.REPORT. If it passes, the Quoting Enclave can use a certified quote signing key
> > to sign a quote containing the guest TD’s measurements and the additional data being
> > quoted.
> 
> <sigh> We're off in the weeds again.
> 
> How many bytes does the TDX module read from and write to the guest
> physical address space?  What are the absolute limits?  What is the
> minimum and the maximum that the kernel needs to handle?

It's VMM (exactly, user space VMM like qemu in KVM case) that handles getquote
request and reads/writes the shared pages.
With KVM, there is no hard limit for user space VMM to access guest memory.

Regarding to the size of quote, there is no hard limit because it's for future
usage.  But here is a rough idea about it.

- 4KB(1 page) can accommodate TDREPORT that is defined right now.
- 128KB is too small for foreseen use cases.
- 1MB is too big. big enough for future.

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 18:42                 ` Isaku Yamahata
@ 2022-07-21 18:52                   ` Dave Hansen
  2022-07-21 18:57                     ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-07-21 18:52 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Kai Huang,
	Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On 7/21/22 11:42, Isaku Yamahata wrote:
> Regarding to the size of quote, there is no hard limit because it's for future
> usage.  But here is a rough idea about it.
> 
> - 4KB(1 page) can accommodate TDREPORT that is defined right now.
> - 128KB is too small for foreseen use cases.
> - 1MB is too big. big enough for future.

Again, we're off in the weeds.

How does the VMM know how much to read/write?  I have a theory: the spec
says that R12 is:

	"Shared 4KB GPA as input – the memory contains a
	TDREPORT_STRUCT."

That's *A* 4KB GPA.  The maximum is one 4KB page.  That's the only thing
that makes sense because there's no length in the ABI anywhere.

What am I missing?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 18:52                   ` Dave Hansen
@ 2022-07-21 18:57                     ` Sathyanarayanan Kuppuswamy
  2022-07-21 19:23                       ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-21 18:57 UTC (permalink / raw)
  To: Dave Hansen, Isaku Yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 7/21/22 11:52 AM, Dave Hansen wrote:
> On 7/21/22 11:42, Isaku Yamahata wrote:
>> Regarding to the size of quote, there is no hard limit because it's for future
>> usage.  But here is a rough idea about it.
>>
>> - 4KB(1 page) can accommodate TDREPORT that is defined right now.
>> - 128KB is too small for foreseen use cases.
>> - 1MB is too big. big enough for future.
> 
> Again, we're off in the weeds.
> 
> How does the VMM know how much to read/write?  I have a theory: the spec
> says that R12 is:
> 
> 	"Shared 4KB GPA as input – the memory contains a
> 	TDREPORT_STRUCT."
> 
> That's *A* 4KB GPA.  The maximum is one 4KB page.  That's the only thing
> that makes sense because there's no length in the ABI anywhere.
> 
> What am I missing?

I think you are looking into the old spec. Please check the version
"FEBRUARY 2022"

Following are the ABI details:

R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
       same buffer is used as output – the memory contains a TD Quote.
R13 -  Size of shared GPA. The size must be 4KB-aligned.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 18:57                     ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 19:23                       ` Dave Hansen
  2022-07-21 22:08                         ` Sathyanarayanan Kuppuswamy
  2022-07-21 23:16                         ` Kai Huang
  0 siblings, 2 replies; 77+ messages in thread
From: Dave Hansen @ 2022-07-21 19:23 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Isaku Yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On 7/21/22 11:57, Sathyanarayanan Kuppuswamy wrote:
>> How does the VMM know how much to read/write?  I have a theory: the spec
>> says that R12 is:
>>
>> 	"Shared 4KB GPA as input – the memory contains a
>> 	TDREPORT_STRUCT."
>>
>> That's *A* 4KB GPA.  The maximum is one 4KB page.  That's the only thing
>> that makes sense because there's no length in the ABI anywhere.
>>
>> What am I missing?
> I think you are looking into the old spec. Please check the version
> "FEBRUARY 2022"
> 
> Following are the ABI details:
> 
> R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
> R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
>        same buffer is used as output – the memory contains a TD Quote.
> R13 -  Size of shared GPA. The size must be 4KB-aligned.

Yeah, silly me.  I assumed the ABI was stable and wouldn't be, you know,
adding and removing parameters.

I still don't know how this all works.  You just said:

> Current ABI allows attestation service and agent to decide the quote size. So
> we can't make assumptions on what that size will be.

But, the guest *HAS* to make assumptions, right?  It's allocating the
buffer and handing a pointer and size over to the host.  It's also guest
*userspace*.  In fact, this implementation *ABSOLUTELY* makes
assumptions about the buffer size.

If host userspace some day decides it needs 5MB of space, then all the
guests will just stop working.  This implementation is limited by the
max page allocator size.

This all just seems to work by chance.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 19:23                       ` Dave Hansen
@ 2022-07-21 22:08                         ` Sathyanarayanan Kuppuswamy
  2022-07-21 23:16                         ` Kai Huang
  1 sibling, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-21 22:08 UTC (permalink / raw)
  To: Dave Hansen, Isaku Yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 7/21/22 12:23 PM, Dave Hansen wrote:
> On 7/21/22 11:57, Sathyanarayanan Kuppuswamy wrote:
>>> How does the VMM know how much to read/write?  I have a theory: the spec
>>> says that R12 is:
>>>
>>> 	"Shared 4KB GPA as input – the memory contains a
>>> 	TDREPORT_STRUCT."
>>>
>>> That's *A* 4KB GPA.  The maximum is one 4KB page.  That's the only thing
>>> that makes sense because there's no length in the ABI anywhere.
>>>
>>> What am I missing?
>> I think you are looking into the old spec. Please check the version
>> "FEBRUARY 2022"
>>
>> Following are the ABI details:
>>
>> R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
>> R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
>>        same buffer is used as output – the memory contains a TD Quote.
>> R13 -  Size of shared GPA. The size must be 4KB-aligned.
> 
> Yeah, silly me.  I assumed the ABI was stable and wouldn't be, you know,
> adding and removing parameters.
> 
> I still don't know how this all works.  You just said:

At a high level, the quote generation communication flow is as below:

Entities involved are,

Guest TD user app (send TDREPORT)  <-> Quoting Enclave (QE) (Sign TDREPORT and send Quote)

Steps involved are,

1. Attestation agent (in TD userspace)  will get the TDREPORT data from
   the TDX module.
2. To generate a remotely verifiable Quote, send the TDREPORT data to the
   Quoting enclave (QE). QE will verify the integrity of TDREPORT and sign
   it with the attestation key.
   * QE can be hosted as simple app in the host, or it can be hosted in a special
     TD guest.
3. Method of sending TDREPORT to QE depends on the QE requirements. In most
   cases, it will support TCP/IP or vsock communication models. So the attestation
   agent can easily send the TDREPORT via TCP/IP and get the Quote data.

But for platforms that do not want to support TCP/IP or socket communication,
TDX ABI defines a method of getting Quote using TDVMCALL. It is a less common
case. Our current discussion is related to this approach.

Entities involved in this approach are,

TD Guest user app <-> TD Guest kernel <-> VMM  <-> QE

Communication flow looks like below:

1. Attestation agent (in TD userspace)  will get the TDREPORT data from
   the TDX module.
2. Check with QE about the required quote size (it can be some form of
   agreement between QE and attestation agent).
3. Allocate space for Quote buffer and update the header details like
   mentioned below and send the quote buffer and length details via
   IOCTL

+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+       /* Quote version, filled by TD */
+       __u64 version;
+       /* Status code of Quote request, filled by VMM */
+       __u64 status;
+       /* Length of TDREPORT, filled by TD */
+       __u32 in_len;
+       /* Length of Quote, filled by VMM */
+       __u32 out_len;
+       /* Actual Quote data or TDREPORT on input */
+       __u64 data[0];
+};
+
+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf         : Pass user data that includes TDREPORT as input. Upon
+ *                successful completion of IOCTL, output is copied
+ *                back to the same buffer.
+ * @len         : Length of the Quote buffer.
+ */
+struct tdx_quote_req {
+       __u64 buf;
+       __u64 len;
+}

4. Once an IOCTL request is received, the kernel will allocate a Quote buffer
   of the requested size, and use GetQuote hypercall to send this request to VMM. 

   GetQuote (TDVMCALL type, kernel buffer physical addr, length of the buffer).
   
5. Upon hypercall request, VMM will further send the details to the QE. Once
   QE processes this request and generates the Quote, it will update the status
   details in the Quote header and copy the Quote back to the guest GPA. After
   completing the request, it will also send the IRQ notification to the TD
   Guest kernel. Event notification IRQ requirement is due to this step.
   
6. Once TD Guest kernel receives event notification, it will copy the 
   contents of the Quote back to the user buffer and complete the IOCTL
   request.

> 
>> Current ABI allows attestation service and agent to decide the quote size. So
>> we can't make assumptions on what that size will be.
> 
> But, the guest *HAS* to make assumptions, right?  It's allocating the
> buffer and handing a pointer and size over to the host.  It's also guest
> *userspace*.  In fact, this implementation *ABSOLUTELY* makes
> assumptions about the buffer size.
> 
> If host userspace some day decides it needs 5MB of space, then all the
> guests will just stop working.  This implementation is limited by the
> max page allocator size.

Agree. We are assuming that the user agent will not request buffer that
exceeds the limit of the page allocator. But as Isaku mentioned, it is
expected that the Quote size will only be in KBs. I think the main
reason for ABI not limiting the size is, for flexibility and to
support future requirements.

Also, as mentioned above, this approach is not a common case. TCP/IP or
vsock models are generally supported and can be used for Quote generation.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 19:23                       ` Dave Hansen
  2022-07-21 22:08                         ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 23:16                         ` Kai Huang
  1 sibling, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-07-21 23:16 UTC (permalink / raw)
  To: Dave Hansen, Sathyanarayanan Kuppuswamy, Isaku Yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-07-21 at 12:23 -0700, Dave Hansen wrote:
> On 7/21/22 11:57, Sathyanarayanan Kuppuswamy wrote:
> > > How does the VMM know how much to read/write?  I have a theory: the spec
> > > says that R12 is:
> > > 
> > > 	"Shared 4KB GPA as input – the memory contains a
> > > 	TDREPORT_STRUCT."
> > > 
> > > That's *A* 4KB GPA.  The maximum is one 4KB page.  That's the only thing
> > > that makes sense because there's no length in the ABI anywhere.
> > > 
> > > What am I missing?
> > I think you are looking into the old spec. Please check the version
> > "FEBRUARY 2022"
> > 
> > Following are the ABI details:
> > 
> > R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
> > R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
> >        same buffer is used as output – the memory contains a TD Quote.
> > R13 -  Size of shared GPA. The size must be 4KB-aligned.
> 
> Yeah, silly me.  I assumed the ABI was stable and wouldn't be, you know,
> adding and removing parameters.
> 
> I still don't know how this all works.  You just said:
> 
> > Current ABI allows attestation service and agent to decide the quote size. So
> > we can't make assumptions on what that size will be.
> 
> But, the guest *HAS* to make assumptions, right?  It's allocating the
> buffer and handing a pointer and size over to the host.  It's also guest
> *userspace*.  In fact, this implementation *ABSOLUTELY* makes
> assumptions about the buffer size.
> 
> If host userspace some day decides it needs 5MB of space, then all the
> guests will just stop working.  This implementation is limited by the
> max page allocator size.
> 
> This all just seems to work by chance.

The Intel's Quote format is pretty much defined in some spec.  I don't know
whether the spec is public or not, though.  But Quote by definition has Intel's
certificates embedded so the size is indeed variable -- even though in reality
it can be treated as fixed as Intel's certificates don't change often.

Intel's QGS (Quote Generation Service) once had an API to query Quote size but
it  got removed somehow, and instead, Intel used hard-coded (8K or 16K I forgot)
buffer, which is big enough for now.

Also, theoretically, 3rd party can add more staff (i.e. their certificates) when
they deploy their own attestation services, so the Quote can even be 3rd party
dependent.

So in short, yes, we Quote size is variable and it is determined by userspace. 
But in reality, 16K is big enough for foreseeable future.

When using the vsock, userspace just uses standard socket API to read data from
QGS.  It has all the flexibility, for instance, it can read the header first
(which is couple of bytes and fixed size) and decode the exact Quote size.  Then
it can allocate a large enough buffer and read once for all.  How vsock in
kernel uses whatever shared buffer size is implemented by vsock and transparent
to the userspace.

But with GetQuote TDVMCALL we don't have the luxury.  Userspace needs to tell a
big enough buffer to the kernel since the GetQuote must receive the entire Quote
at once.

That being said, ideally, what we need is a TDVMCALL based communication
channel, instead of bunches of TDVMCALLs with each being associated one specific
operation (i.e. GetQuote).  But obviously this isn't the direction we are
heading.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-21 16:08   ` Dave Hansen
  2022-07-21 16:42     ` Sathyanarayanan Kuppuswamy
@ 2022-07-21 23:32     ` Kai Huang
  1 sibling, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-07-21 23:32 UTC (permalink / raw)
  To: Dave Hansen, Kuppuswamy Sathyanarayanan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-07-21 at 09:08 -0700, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> > For shared buffer allocation, alternatives like using the DMA API is
> > also considered. Although it simpler to use, it is not preferred because
> > dma_alloc_*() APIs require a valid bus device as argument, which would
> > need converting the attestation driver into a platform device driver.
> > This is unnecessary, and since the attestation driver does not do real
> > DMA, there is no need to use real DMA APIs.
> 
> Let's actually try to walk through the requirements for the memory
> allocation here.
> 
>  1. The guest kernel needs to allocate some guest physical memory
>     for the attestation data buffer
>  2. The guest physical memory must be mapped by the guest so that
>     it can be read/written.
>  3. The guest mapping must be a "TDX Shared" mapping.  Since all
>     guest physical memory is "TDX Private" by default, something
>     must convert the memory from Private->Shared.
>  4. If there are alias mappings with "TDX Private" page table
>     permissions, those mappings must never be used while the page is
>     in its shared state.
>     4a. load_unaligned_zeropad() must be prevented from being used
> 	on the page immediately preceding a Private alias to a Shared
>  	page.
>  5. Actions that increasingly fracture the direct map must be avoided.
>     Attestation may happen many times and repeated allocations that
>     fracture the direct map have performance consequences.
>  6. A softer requirement: presuming that bounce buffers won't be used
>     for TDX devices *forever*, it would be nice to use a mechanism that
>     will continue to work on systems that don't have swiotlb on.
> 
> I think we've talked about three different solutions:
> 
> == vmalloc() ==
> 
> So, let's say we used a relatively plain vmalloc().  That's great for
> #1->#3 as long as the vmalloc() mapping gets the "TDX Shared" bit set
> properly on its PTEs.  But, it falls over for *either* #4 or #5.  If it
> leaves the direct map alone, it's exposed to load_unaligned_zeropad().
> If it unmaps the memory from the direct map, it runs afoul of #5.
> 
> == order-1 + vmap() ==
> 
> Let's now consider a vmalloc() variant: allocate a bunch of order-1
> pages and vmap() page[1], leaving page[0] as a guard page against
> load_unaligned_zeropad() on the direct map.  That works, but it's an
> annoying amount of code.
> 
> == swiotlb pages ==
> 
> Using the swiotlb bounce buffer pages is the other proposed option.
> They already have a working kernel mapping and have already been
> converted.  They are mitigated against load_unaligned_zeropad().  They
> do cause direct map fracturing, but only once since they're allocated
> statically.  They don't increasingly degrade things.  It's a one-time
> cost.  Their interaction with #6 is not great.
> 
> Did I miss anything?  Does that accurately capture where we are?

We can also reserve a dedicated CMA, but Kirill didn't like it.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-06-09  2:52 ` [PATCH v8 5/5] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2022-06-14 12:30   ` Wander Lairson Costa
  2022-07-21 16:08   ` Dave Hansen
@ 2022-07-22  0:27   ` Dave Hansen
  2022-07-22 19:05     ` Isaku Yamahata
  2 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-07-22  0:27 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> In TDX guest, the second stage in attestation process is to send the
> TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
> not support communication channels like vsock or TCP/IP, implement
> support to get TD Quote using hypercall. GetQuote hypercall can be used
> by the TD guest to request VMM facilitate the Quote generation via
> QE/QGS. More details about GetQuote hypercall can be found in TDX
> Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
> titled "TDG.VP.VMCALL<GetQuote>.

So, the quote portion of this is basically a bidirectional blob sender.
 It's to send a blob between guest userspace to host userspace.

Do we *REALLY* need specific driver functionality for this?  For
instance, is there no existing virtio device that can send blobs back
and forth?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-22  0:27   ` Dave Hansen
@ 2022-07-22 19:05     ` Isaku Yamahata
  2022-07-22 19:13       ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Isaku Yamahata @ 2022-07-22 19:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Kai Huang,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, Jul 21, 2022 at 05:27:08PM -0700,
Dave Hansen <dave.hansen@intel.com> wrote:

> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> > In TDX guest, the second stage in attestation process is to send the
> > TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
> > not support communication channels like vsock or TCP/IP, implement
> > support to get TD Quote using hypercall. GetQuote hypercall can be used
> > by the TD guest to request VMM facilitate the Quote generation via
> > QE/QGS. More details about GetQuote hypercall can be found in TDX
> > Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
> > titled "TDG.VP.VMCALL<GetQuote>.
> 
> So, the quote portion of this is basically a bidirectional blob sender.
>  It's to send a blob between guest userspace to host userspace.
> 
> Do we *REALLY* need specific driver functionality for this?  For
> instance, is there no existing virtio device that can send blobs back
> and forth?

It's virtio-vsock.  If virtio-vsock is available, the communication works.
However, some users would like to disable virtio-vsock on their environment for
some reasons.  Even virtio at all.  Especially for confidential computing use
case.  It's their choice.  It can't be assumed that virtio is available.

The goal is VMM-agnostic (but TDX-specific) interface for that.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-22 19:05     ` Isaku Yamahata
@ 2022-07-22 19:13       ` Dave Hansen
  2022-07-22 21:18         ` Sathyanarayanan Kuppuswamy
  2022-07-25 11:05         ` Kai Huang
  0 siblings, 2 replies; 77+ messages in thread
From: Dave Hansen @ 2022-07-22 19:13 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Kai Huang,
	Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On 7/22/22 12:05, Isaku Yamahata wrote:
>> So, the quote portion of this is basically a bidirectional blob sender.
>>  It's to send a blob between guest userspace to host userspace.
>>
>> Do we *REALLY* need specific driver functionality for this?  For
>> instance, is there no existing virtio device that can send blobs back
>> and forth?
> It's virtio-vsock.  If virtio-vsock is available, the communication works.
> However, some users would like to disable virtio-vsock on their environment for
> some reasons.  Even virtio at all.  Especially for confidential computing use
> case.  It's their choice.  It can't be assumed that virtio is available.
> 
> The goal is VMM-agnostic (but TDX-specific) interface for that.

You're basically saying that every confidential computing technology
should have its own host user <-> guest kernel <-> guest user ABI.
That's insanity.  If we do this, we need *one* interface that says "talk
to the hypervisor" that's common for all hypervisors and hardware
vendors, or at least more than *one*.

We don't need a way to talk to hypervisors for Intel systems and another
for AMD and yet another on whatever.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-22 19:13       ` Dave Hansen
@ 2022-07-22 21:18         ` Sathyanarayanan Kuppuswamy
  2022-07-22 21:24           ` Dave Hansen
  2022-07-25 20:19           ` Nakajima, Jun
  2022-07-25 11:05         ` Kai Huang
  1 sibling, 2 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-22 21:18 UTC (permalink / raw)
  To: Dave Hansen, Isaku Yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel, Nakajima, Jun

+ Jun

On 7/22/22 12:13 PM, Dave Hansen wrote:
> On 7/22/22 12:05, Isaku Yamahata wrote:
>>> So, the quote portion of this is basically a bidirectional blob sender.
>>>  It's to send a blob between guest userspace to host userspace.
>>>
>>> Do we *REALLY* need specific driver functionality for this?  For
>>> instance, is there no existing virtio device that can send blobs back
>>> and forth?
>> It's virtio-vsock.  If virtio-vsock is available, the communication works.
>> However, some users would like to disable virtio-vsock on their environment for
>> some reasons.  Even virtio at all.  Especially for confidential computing use
>> case.  It's their choice.  It can't be assumed that virtio is available.
>>
>> The goal is VMM-agnostic (but TDX-specific) interface for that.
> 
> You're basically saying that every confidential computing technology
> should have its own host user <-> guest kernel <-> guest user ABI.
> That's insanity.  If we do this, we need *one* interface that says "talk
> to the hypervisor" that's common for all hypervisors and hardware
> vendors, or at least more than *one*.
> 
> We don't need a way to talk to hypervisors for Intel systems and another
> for AMD and yet another on whatever.

For cases where your platform does not want to support or enable the generic
interface (like vsock), isn't it better to have a fallback approach? I am not
saying we should have such an ABI for all cases. But attestation is a must-have
feature for the TDX guest, and we want to support it in all TD guest platforms.
I think the GHCI ABI is added to meet this requirement.

Jun/Isaku, if you are aware of the exact requirement for this hypercall, please
share it. Also let us know your comments on this topic.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-22 21:18         ` Sathyanarayanan Kuppuswamy
@ 2022-07-22 21:24           ` Dave Hansen
  2022-07-25 20:19           ` Nakajima, Jun
  1 sibling, 0 replies; 77+ messages in thread
From: Dave Hansen @ 2022-07-22 21:24 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Isaku Yamahata
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel, Nakajima, Jun

On 7/22/22 14:18, Sathyanarayanan Kuppuswamy wrote:
> For cases where your platform does not want to support or enable the generic
> interface (like vsock), isn't it better to have a fallback approach? I am not
> saying we should have such an ABI for all cases. But attestation is a must-have
> feature for the TDX guest, and we want to support it in all TD guest platforms.
> I think the GHCI ABI is added to meet this requirement.

This logic is basically: it's in the spec so it must be useful.  I don't
buy that, sorry.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-22 19:13       ` Dave Hansen
  2022-07-22 21:18         ` Sathyanarayanan Kuppuswamy
@ 2022-07-25 11:05         ` Kai Huang
  1 sibling, 0 replies; 77+ messages in thread
From: Kai Huang @ 2022-07-25 11:05 UTC (permalink / raw)
  To: Dave Hansen, Isaku Yamahata
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Wander Lairson Costa,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On Fri, 2022-07-22 at 12:13 -0700, Dave Hansen wrote:
> On 7/22/22 12:05, Isaku Yamahata wrote:
> > > So, the quote portion of this is basically a bidirectional blob sender.
> > >  It's to send a blob between guest userspace to host userspace.
> > > 
> > > Do we *REALLY* need specific driver functionality for this?  For
> > > instance, is there no existing virtio device that can send blobs back
> > > and forth?
> > It's virtio-vsock.  If virtio-vsock is available, the communication works.
> > However, some users would like to disable virtio-vsock on their environment for
> > some reasons.  Even virtio at all.  Especially for confidential computing use
> > case.  It's their choice.  It can't be assumed that virtio is available.
> > 
> > The goal is VMM-agnostic (but TDX-specific) interface for that.
> 
> You're basically saying that every confidential computing technology
> should have its own host user <-> guest kernel <-> guest user ABI.
> That's insanity.  If we do this, we need *one* interface that says "talk
> to the hypervisor" that's common for all hypervisors and hardware
> vendors, or at least more than *one*.
> 
> We don't need a way to talk to hypervisors for Intel systems and another
> for AMD and yet another on whatever.

Maybe the GetQuote support can be a "Intel driver" with a dedicated Kconfig
option?  Not all customers want it anyway, so having a Kconfig option can also
reduce code size/attack window.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-22 21:18         ` Sathyanarayanan Kuppuswamy
  2022-07-22 21:24           ` Dave Hansen
@ 2022-07-25 20:19           ` Nakajima, Jun
  2022-07-25 20:23             ` Dave Hansen
  1 sibling, 1 reply; 77+ messages in thread
From: Nakajima, Jun @ 2022-07-25 20:19 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Hansen, Dave, Isaku Yamahata, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Luck, Tony, Andi Kleen, Huang, Kai,
	Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, linux-kernel

> On Jul 22, 2022, at 2:18 PM, Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> 
> + Jun
> 
> On 7/22/22 12:13 PM, Dave Hansen wrote:
>> On 7/22/22 12:05, Isaku Yamahata wrote:
>>>> So, the quote portion of this is basically a bidirectional blob sender.
>>>> It's to send a blob between guest userspace to host userspace.
>>>> 
>>>> Do we *REALLY* need specific driver functionality for this?  For
>>>> instance, is there no existing virtio device that can send blobs back
>>>> and forth?
>>> It's virtio-vsock.  If virtio-vsock is available, the communication works.
>>> However, some users would like to disable virtio-vsock on their environment for
>>> some reasons.  Even virtio at all.  Especially for confidential computing use
>>> case.  It's their choice.  It can't be assumed that virtio is available.
>>> 
>>> The goal is VMM-agnostic (but TDX-specific) interface for that.
>> 
>> You're basically saying that every confidential computing technology
>> should have its own host user <-> guest kernel <-> guest user ABI.
>> That's insanity.  If we do this, we need *one* interface that says "talk
>> to the hypervisor" that's common for all hypervisors and hardware
>> vendors, or at least more than *one*.
>> 
>> We don't need a way to talk to hypervisors for Intel systems and another
>> for AMD and yet another on whatever.
> 
> For cases where your platform does not want to support or enable the generic
> interface (like vsock), isn't it better to have a fallback approach? I am not
> saying we should have such an ABI for all cases. But attestation is a must-have
> feature for the TDX guest, and we want to support it in all TD guest platforms.
> I think the GHCI ABI is added to meet this requirement.
> 
> Jun/Isaku, if you are aware of the exact requirement for this hypercall, please
> share it. Also let us know your comments on this topic.
> 

Yes, a quote is a blob, and there are special things with that because of the nature (i.e. the essential data for verification).
1. It’s small (e.g. 4KB or something like that).
2. One-time. It shouldn't change even if you repeat the request (GetQuote).
3. Need to be available in minimal/early runtime environments, including pre-boot, e.g. guest BIOS, no user-space yet.

In my view, getting a quote using virtio-vsock is overkill both for the host and the guest. The host may not want the guests to talk directly to the host because of security concerns. 

This particular patch allows the guest user-space to get a quote for an attestation service, and it can be used for the first attestation of the guest, i.e. when the guest kernel has not been verified yet. It’s useful when getting the key for the guest confidential data, such as user data volume, before mounting the volume. Can we use virtio-vsock there? Yes, but, again, overkill and limited availability.

Since we don’t want to allow the user-space to use a hypercall (to get a quote), I think some sort of driver should be needed there. So, I think this driver is useful at least as a fallback when virtio-vsock is not available. 


--- 
Jun



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-25 20:19           ` Nakajima, Jun
@ 2022-07-25 20:23             ` Dave Hansen
  2022-07-25 21:56               ` Nakajima, Jun
  2022-08-09  6:20               ` Guorui Yu
  0 siblings, 2 replies; 77+ messages in thread
From: Dave Hansen @ 2022-07-25 20:23 UTC (permalink / raw)
  To: Nakajima, Jun, Sathyanarayanan Kuppuswamy
  Cc: Isaku Yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Kirill A . Shutemov, Luck,
	Tony, Andi Kleen, Huang, Kai, Wander Lairson Costa,
	marcelo.cerri, tim.gardner, khalid.elmously, Cox, Philip,
	linux-kernel

On 7/25/22 13:19, Nakajima, Jun wrote:
> 3. Need to be available in minimal/early runtime environments,
> including pre-boot, e.g. guest BIOS, no user-space yet.

Jun, are we talking about the same thing here?  This patch is for a
guest userspace -> guest kernel ABI.  This facility is *FOR* userspace.
 It can't possibly be used before userspace is running.

I'm horribly confused.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-25 20:23             ` Dave Hansen
@ 2022-07-25 21:56               ` Nakajima, Jun
  2022-07-25 22:06                 ` Sathyanarayanan Kuppuswamy
  2022-08-09  6:20               ` Guorui Yu
  1 sibling, 1 reply; 77+ messages in thread
From: Nakajima, Jun @ 2022-07-25 21:56 UTC (permalink / raw)
  To: Hansen, Dave
  Cc: Sathyanarayanan Kuppuswamy, Isaku Yamahata, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Luck, Tony, Andi Kleen, Huang, Kai,
	Wander Lairson Costa, marcelo.cerri, tim.gardner,
	khalid.elmously, Cox, Philip, linux-kernel


> On Jul 25, 2022, at 1:23 PM, Hansen, Dave <dave.hansen@intel.com> wrote:
> 
> On 7/25/22 13:19, Nakajima, Jun wrote:
>> 3. Need to be available in minimal/early runtime environments,
>> including pre-boot, e.g. guest BIOS, no user-space yet.
> 
> Jun, are we talking about the same thing here?  This patch is for a
> guest userspace -> guest kernel ABI.  This facility is *FOR* userspace.
> It can't possibly be used before userspace is running.
> 
> I'm horribly confused.

I responded to one of Sathya’s questions, especially why we have the GetQuote in GHCI. 
And the hypervisor needs to implement that anyway because it doesn’t matter (or doesn’t know) whether the TD guest is running in BIOS, the kernel, or userspace. Of course, the facility in this patch is for userspace, but we don’t want to suggest to implement two different GetQuote code paths for guests, depending on the guest state, e.g. in the OS (kernel or userspace) or guest BIOS.

--- 
Jun





^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-25 21:56               ` Nakajima, Jun
@ 2022-07-25 22:06                 ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 77+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-07-25 22:06 UTC (permalink / raw)
  To: Nakajima, Jun, Hansen, Dave
  Cc: Isaku Yamahata, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Kirill A . Shutemov, Luck,
	Tony, Andi Kleen, Huang, Kai, Wander Lairson Costa,
	marcelo.cerri, tim.gardner, khalid.elmously, Cox, Philip,
	linux-kernel

Hi Jun,

On 7/25/22 2:56 PM, Nakajima, Jun wrote:
> 
>> On Jul 25, 2022, at 1:23 PM, Hansen, Dave <dave.hansen@intel.com> wrote:
>>
>> On 7/25/22 13:19, Nakajima, Jun wrote:
>>> 3. Need to be available in minimal/early runtime environments,
>>> including pre-boot, e.g. guest BIOS, no user-space yet.
>>
>> Jun, are we talking about the same thing here?  This patch is for a
>> guest userspace -> guest kernel ABI.  This facility is *FOR* userspace.
>> It can't possibly be used before userspace is running.
>>
>> I'm horribly confused.
> 
> I responded to one of Sathya’s questions, especially why we have the GetQuote in GHCI. 
> And the hypervisor needs to implement that anyway because it doesn’t matter (or doesn’t know) whether the TD guest is running in BIOS, the kernel, or userspace. Of course, the facility in this patch is for userspace, but we don’t want to suggest to implement two different GetQuote code paths for guests, depending on the guest state, e.g. in the OS (kernel or userspace) or guest BIOS.

Ok. Since both host and QE need to support GetQuote hypercall to
handle attestation request from the BIOS, QE/host may want to use
the same communication model for requests from the guest user space
as well.

> 
> --- 
> Jun
> 
> 
> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-07-25 20:23             ` Dave Hansen
  2022-07-25 21:56               ` Nakajima, Jun
@ 2022-08-09  6:20               ` Guorui Yu
  2022-11-21  2:04                 ` Guorui Yu
  1 sibling, 1 reply; 77+ messages in thread
From: Guorui Yu @ 2022-08-09  6:20 UTC (permalink / raw)
  To: dave.hansen
  Cc: ak, bp, dave.hansen, hpa, isaku.yamahata, jun.nakajima,
	kai.huang, khalid.elmously, kirill.shutemov, linux-kernel,
	marcelo.cerri, mingo, philip.cox, sathyanarayanan.kuppuswamy,
	tglx, tim.gardner, tony.luck, wander, x86

To my opinion, virtio-vsock is a wonderful way to achieve the 
comunication between the host and the guest under traditional scenario.

But from secuirty perspective, virtio-vsock itself may involve too much 
of components, which may lead to considerable attack surfaces. In order 
for virtio to work properly, the guest must maintain several vrings 
placed in shared memory, which in turn depends on swiotlb in current 
codebase. As this flow is NOT designed for confidential scenario, there 
are some on-going/existing works on virtio hardening [1] [2] [3].

 > Do we *REALLY* need specific driver functionality for this?  For
 > instance, is there no existing virtio device that can send blobs back
 > and forth?

And we may indeed have use cases where virtio is not used at all, such 
as when we consider Intel-TDX/AMD-SNP protected VMs as a SGX enclave 
replacement. In such use cases, end users may expect a smaller 
dependency on the GetQuote interface. It doesn't seem natural that the 
GetQuote interface cannot be used because virtio-vsock is disabled (via 
kernel compile options or the upcoming device-filter), since GetQuote 
doesn't inherently depend on it.

Thanks,
Guorui

[1] 
https://lore.kernel.org/lkml/20210603004133.4079390-1-ak@linux.intel.com/
[2] https://lore.kernel.org/lkml/20211012065227.9953-1-jasowang@redhat.com/
[3] 
https://lore.kernel.org/linux-iommu/d2ae0b1d-332b-42a1-87bf-7da2b749cac2@sect.tu-berlin.de/

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-08-09  6:20               ` Guorui Yu
@ 2022-11-21  2:04                 ` Guorui Yu
  2022-11-21  2:26                   ` Dave Hansen
  0 siblings, 1 reply; 77+ messages in thread
From: Guorui Yu @ 2022-11-21  2:04 UTC (permalink / raw)
  To: dave.hansen
  Cc: ak, bp, dave.hansen, hpa, isaku.yamahata, jun.nakajima,
	kai.huang, khalid.elmously, kirill.shutemov, linux-kernel,
	marcelo.cerri, mingo, philip.cox, sathyanarayanan.kuppuswamy,
	tglx, tim.gardner, tony.luck, wander, x86, Du, Fan

Hi Dave,

Kindly ping for any comments here?

Sincerely,
Guorui

在 2022/8/9 14:20, Guorui Yu 写道:
> To my opinion, virtio-vsock is a wonderful way to achieve the 
> comunication between the host and the guest under traditional scenario.
> 
> But from secuirty perspective, virtio-vsock itself may involve too much 
> of components, which may lead to considerable attack surfaces. In order 
> for virtio to work properly, the guest must maintain several vrings 
> placed in shared memory, which in turn depends on swiotlb in current 
> codebase. As this flow is NOT designed for confidential scenario, there 
> are some on-going/existing works on virtio hardening [1] [2] [3].
> 
>  > Do we *REALLY* need specific driver functionality for this?  For
>  > instance, is there no existing virtio device that can send blobs back
>  > and forth?
> 
> And we may indeed have use cases where virtio is not used at all, such 
> as when we consider Intel-TDX/AMD-SNP protected VMs as a SGX enclave 
> replacement. In such use cases, end users may expect a smaller 
> dependency on the GetQuote interface. It doesn't seem natural that the 
> GetQuote interface cannot be used because virtio-vsock is disabled (via 
> kernel compile options or the upcoming device-filter), since GetQuote 
> doesn't inherently depend on it.
> 
> Thanks,
> Guorui
> 
> [1] 
> https://lore.kernel.org/lkml/20210603004133.4079390-1-ak@linux.intel.com/
> [2] https://lore.kernel.org/lkml/20211012065227.9953-1-jasowang@redhat.com/
> [3] 
> https://lore.kernel.org/linux-iommu/d2ae0b1d-332b-42a1-87bf-7da2b749cac2@sect.tu-berlin.de/

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-11-21  2:04                 ` Guorui Yu
@ 2022-11-21  2:26                   ` Dave Hansen
  2023-01-07  0:58                     ` Erdem Aktas
  0 siblings, 1 reply; 77+ messages in thread
From: Dave Hansen @ 2022-11-21  2:26 UTC (permalink / raw)
  To: Guorui Yu
  Cc: ak, bp, dave.hansen, hpa, isaku.yamahata, jun.nakajima,
	kai.huang, khalid.elmously, kirill.shutemov, linux-kernel,
	marcelo.cerri, mingo, philip.cox, sathyanarayanan.kuppuswamy,
	tglx, tim.gardner, tony.luck, wander, x86, Du, Fan

On 11/20/22 18:04, Guorui Yu wrote:
> Kindly ping for any comments here?

No comments from me!

Seriously, this is from June.  I honestly have no idea what this patch
is for or why you're asking about it.  If it were  important, I would
have expected a new post with a new cover letter telling me some time in
the past 5-6 months.

Oh, and btw, I haven't seen you reviewing any other x86 submissions
during that past 5-6 months either.  I'm tend to be more likely to bump
things up near the front of the queue for folks that are helping to keep
the queue short and manageable.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
  2022-11-21  2:26                   ` Dave Hansen
@ 2023-01-07  0:58                     ` Erdem Aktas
  0 siblings, 0 replies; 77+ messages in thread
From: Erdem Aktas @ 2023-01-07  0:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Guorui Yu, ak, bp, dave.hansen, hpa, isaku.yamahata,
	jun.nakajima, kai.huang, khalid.elmously, kirill.shutemov,
	linux-kernel, marcelo.cerri, mingo, philip.cox,
	sathyanarayanan.kuppuswamy, tglx, tim.gardner, tony.luck, wander,
	x86, Du, Fan, Chong Cai, Sean Christopherson

I apologize for bumping an old thread up and getting in the discussion
this late.

Bumping this thread because as Chong mentioned below, it is essential
for us to have TDG.VP.VMCALL<GetQuote> interface as we do not
enable/support virtio in many use cases.
Coming from the  "Add TDX Guest Attestation support patch set [1]"
which does not have the Quote support as per Dave's review suggestion,
I am looking for what our options are here.

@Hansen, Dave  : as far as I understand your concerns were:

>>Do we *REALLY* need specific driver functionality for this? For
>>instance, is there no existing virtio device that can send blobs back and forth?

As Chong and I said, we have to have a vsock alternative solution if
we want to provide TDX based products for our customers. Otherwise it
will really limit our ability to adopt this technology.

>> You're basically saying that every confidential computing technology
>> should have its own host user <-> guest kernel <-> guest user ABI.
>> That's insanity.  If we do this, we need *one* interface that says "talk
>> to the hypervisor" that's common for all hypervisors and hardware
>> vendors, or at least more than *one*.

This is already happening, right? AMD has GHCB ABI while intel has
GHCI. Also there is already a guest driver and IOCTL interface for
TDREPORT, so would it be really that insane to have another IOCTL
interface for GetQuote?

I would like to get your opinion on how we can progress on this?

-Erdem

[1] https://lore.kernel.org/lkml/20221116223820.819090-1-sathyanarayanan.kuppuswamy@linux.intel.com/t/#m341e6475982318d74be160c09a1856ad87e184cf

> From: Chong Cai <chongc@google.com>
> Date: Fri, Jan 6, 2023 at 2:02 PM
> Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
> To: Nakajima, Jun <jun.nakajima@intel.com>
> Cc: <ak@linux.intel.com>, <bp@alien8.de>, <dave.hansen@intel.com>, <dave.hansen@linux.intel.com>, <hpa@zytor.com>, <isaku.yamahata@gmail.com>, <kai.huang@intel.com>, <khalid.elmously@canonical.com>, <kirill.shutemov@linux.intel.com>, <linux-kernel@vger.kernel.org>, <marcelo.cerri@canonical.com>, <mingo@redhat.com>, <philip.cox@canonical.com>, <sathyanarayanan.kuppuswamy@linux.intel.com>, <tglx@linutronix.de>, <tim.gardner@canonical.com>, <tony.luck@intel.com>, <wander@redhat.com>, <x86@kernel.org>
>
>
> Hi everyone,
>
>
> Sorry for being late in the discussion.
>
>
> We just want to add that we have use cases where virtio/vsock is not supported at all; therefore, getquote TDVMCALL support is required for us to adopt this technology.
>
>
> Is there anything we can help here?
>
>


On Sun, Nov 20, 2022 at 6:26 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/20/22 18:04, Guorui Yu wrote:
> > Kindly ping for any comments here?
>
> No comments from me!
>
> Seriously, this is from June.  I honestly have no idea what this patch
> is for or why you're asking about it.  If it were  important, I would
> have expected a new post with a new cover letter telling me some time in
> the past 5-6 months.
>
> Oh, and btw, I haven't seen you reviewing any other x86 submissions
> during that past 5-6 months either.  I'm tend to be more likely to bump
> things up near the front of the queue for folks that are helping to keep
> the queue short and manageable.

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2023-01-07  0:59 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-09  2:52 [PATCH v8 0/5] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
2022-06-09  2:52 ` [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
2022-06-24 16:51   ` Dave Hansen
2022-06-27 14:50     ` Sathyanarayanan Kuppuswamy
2022-06-27 17:24       ` Dave Hansen
2022-06-30 23:50         ` Sathyanarayanan Kuppuswamy
2022-07-05 12:07           ` Kai Huang
2022-07-05 18:45             ` Sathyanarayanan Kuppuswamy
2022-07-05 18:52               ` Dave Hansen
2022-07-05 21:21                 ` Sathyanarayanan Kuppuswamy
2022-07-05 22:31                   ` Kai Huang
2022-07-06 22:27                     ` Sathyanarayanan Kuppuswamy
2022-07-06 22:59                       ` Kai Huang
2022-07-18 22:52                       ` Sathyanarayanan Kuppuswamy
2022-06-09  2:52 ` [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
2022-06-20 12:33   ` Kai Huang
2022-06-20 15:44     ` Sathyanarayanan Kuppuswamy
2022-06-23  9:46       ` Kai Huang
2022-06-23 10:24       ` Kai Huang
2022-06-24 22:23       ` Sathyanarayanan Kuppuswamy
2022-06-24 23:41       ` Nakajima, Jun
2022-06-25  3:35         ` Yao, Jiewen
2022-06-27 11:21           ` Kai Huang
2022-06-27 14:56             ` Sathyanarayanan Kuppuswamy
2022-07-14  0:46             ` Sathyanarayanan Kuppuswamy
2022-07-14 10:42               ` Kai Huang
2022-07-14 20:55                 ` Sathyanarayanan Kuppuswamy
2022-07-14 23:58                   ` Kai Huang
2022-06-09  2:52 ` [PATCH v8 3/5] x86/mm: Make tdx_enc_status_changed() vmalloc address compatible Kuppuswamy Sathyanarayanan
2022-06-09  2:52 ` [PATCH v8 4/5] x86/mm: Add noalias variants of set_memory_*crypted() functions Kuppuswamy Sathyanarayanan
2022-06-24 13:19   ` Dave Hansen
2022-06-27 15:12     ` Kirill A. Shutemov
2022-06-27 18:24       ` Dave Hansen
2022-06-28  1:15         ` Kai Huang
2022-07-05 15:29           ` Kirill A. Shutemov
2022-07-18 14:22             ` Sathyanarayanan Kuppuswamy
2022-07-19 16:13               ` Kirill A. Shutemov
2022-07-19 17:10                 ` Sathyanarayanan Kuppuswamy
2022-07-19 21:55                   ` Kirill A. Shutemov
2022-07-20 14:56                     ` Sathyanarayanan Kuppuswamy
2022-07-20 16:17                       ` Kirill A. Shutemov
2022-07-20 16:58                         ` Sathyanarayanan Kuppuswamy
2022-06-09  2:52 ` [PATCH v8 5/5] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
2022-06-14 12:30   ` Wander Lairson Costa
2022-06-14 12:58     ` Sathyanarayanan Kuppuswamy
2022-07-21 16:08   ` Dave Hansen
2022-07-21 16:42     ` Sathyanarayanan Kuppuswamy
2022-07-21 16:49       ` Dave Hansen
2022-07-21 16:54         ` Sathyanarayanan Kuppuswamy
2022-07-21 17:02           ` Dave Hansen
2022-07-21 17:16             ` Sathyanarayanan Kuppuswamy
2022-07-21 17:19               ` Dave Hansen
2022-07-21 18:31                 ` Sathyanarayanan Kuppuswamy
2022-07-21 18:42                 ` Isaku Yamahata
2022-07-21 18:52                   ` Dave Hansen
2022-07-21 18:57                     ` Sathyanarayanan Kuppuswamy
2022-07-21 19:23                       ` Dave Hansen
2022-07-21 22:08                         ` Sathyanarayanan Kuppuswamy
2022-07-21 23:16                         ` Kai Huang
2022-07-21 23:32     ` Kai Huang
2022-07-22  0:27   ` Dave Hansen
2022-07-22 19:05     ` Isaku Yamahata
2022-07-22 19:13       ` Dave Hansen
2022-07-22 21:18         ` Sathyanarayanan Kuppuswamy
2022-07-22 21:24           ` Dave Hansen
2022-07-25 20:19           ` Nakajima, Jun
2022-07-25 20:23             ` Dave Hansen
2022-07-25 21:56               ` Nakajima, Jun
2022-07-25 22:06                 ` Sathyanarayanan Kuppuswamy
2022-08-09  6:20               ` Guorui Yu
2022-11-21  2:04                 ` Guorui Yu
2022-11-21  2:26                   ` Dave Hansen
2023-01-07  0:58                     ` Erdem Aktas
2022-07-25 11:05         ` Kai Huang
2022-06-24 18:24 ` [PATCH v8 0/5] Add TDX Guest Attestation support Dave Hansen
2022-06-27 14:51   ` Sathyanarayanan Kuppuswamy
2022-06-27 18:51     ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).