linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/3] Add TDX Guest Attestation support
@ 2022-05-01 18:34 Kuppuswamy Sathyanarayanan
  2022-05-01 18:34 ` [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
                   ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-05-01 18:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

Hi All,

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. VM guest with TDX support is called
as TD Guest.

In TD Guest, the attestation process is used to verify the 
trustworthiness of TD guest to the 3rd party servers. Such attestation
process is required by 3rd party servers before sending sensitive
information to TD guests. One usage example is to get encryption keys
from the key server for mounting the encrypted rootfs or secondary drive.
    
Following patches add the attestation support to TDX guest which
includes attestation user interface driver and related hypercall support.

Any distribution enabling TDX is also expected to need attestation. So
enable it by default with TDX guest support. The compiled size is
quite small (500 bytes).

Dependencies:
--------------

This feature has dependency on TDX guest core patch set series.

https://lore.kernel.org/all/20220218161718.67148-1-kirill.shutemov@linux.intel.com/T/

Changes since v4:
 * Removed platform driver model in attestation driver and used
   miscdevice and initcall approach.
 * Since dma_alloc*() APIs require a valid device reference,
   replaced it with __get_free_pages() and set_memory_decrypted()
   for quote memory allocation.
 * Removed tdx_mcall_tdreport() and moved TDG.MR.REPORT TDCALL code
   to tdx_get_report().
 * Used kmalloc() for TDREPORT memory allocation instead of
   get_zeroed_page().
 * Returned -EINVAL in default case of tdx_attest_ioctl().
 * Added struct tdx_report_req to explicitly mention the
   TDX_CMD_GET_REPORT IOCTL argument.
 * Removed tdx_get_quote_hypercall() and moved hypercall code to
   attestation driver itself.
 * Removed GetQuote timeout support (since it is not defined in
   spec)
 * Added support to check for spurious callback interrupt in GetQuote
   request.
 * Fixed commit log and comments as per review suggestions.
   

Changes since v3:
 * Moved the attestation driver from platform/x86 to arch/x86/coco/tdx/ and
   renamed intel_tdx_attest.c to attest.c.
 * Dropped CONFIG_INTEL_TDX_ATTESTATION and added support to compile
   attestation changes with CONFIG_INTEL_TDX_GUEST option.
 * Merged patch titled "x86/tdx: Add tdx_mcall_tdreport() API support" and
   "platform/x86: intel_tdx_attest: Add TDX Guest attestation interface" into
   a single patch.
 * Moved GetQuote IOCTL support changes from patch titled "platform/x86:
   intel_tdx_attest: Add TDX Guest attestation interface driver" to a
   separate patch.
 * Removed 8K size restriction when requesting quote, and added support
   to let userspace decide the quote size.
 * Added support to allow attestation agent configure quote generation
   timeout value.
 * Fixed commit log and comments as per review comments.

Changes since v2:
 * As per Han's suggestion, modified the attestation driver to use
   platform device driver model.
 * Modified tdx_hcall_get_quote() and tdx_mcall_tdreport() APIs to
   return TDCALL error code instead of generic error info (like -EIO).
 * Removed attestation test app patch from this series to simplify
   the patchset and review process. Test app patches will be submitted
   once attestation support patches are merged.
 * Since patches titled "x86/tdx: Add SetupEventNotifyInterrupt TDX
   hypercall support" and "x86/tdx: Add TDX Guest event notify
   interrupt vector support" are related, combining them into a
   single patch.

Changes since v1:
 * Moved test driver from "tools/tdx/attest/tdx-attest-test.c" to
   "tools/arch/x86/tdx/attest/tdx-attest-test.c" as per Hans review
   suggestion.
 * Minor commit log and comment fixes in patches titled
   "x86/tdx: Add tdx_mcall_tdreport() API support" and "x86/tdx:
   Add tdx_hcall_get_quote() API support"
 * Extended tdx_hcall_get_quote() API to accept GPA length as argument
   to accomodate latest TDQUOTE TDVMCALL related specification update.
 * Added support for tdx_setup_ev_notify_handler() and
   tdx_remove_ev_notify_handler() in patch titled "x86/tdx: Add TDX
   Guest event notify interrupt vector support"

Kuppuswamy Sathyanarayanan (3):
  x86/tdx: Add TDX Guest attestation interface driver
  x86/tdx: Add TDX Guest event notify interrupt support
  x86/tdx: Add Quote generation support

 arch/x86/coco/tdx/Makefile         |   2 +-
 arch/x86/coco/tdx/attest.c         | 273 +++++++++++++++++++++++++++++
 arch/x86/coco/tdx/tdx.c            |  73 ++++++++
 arch/x86/include/asm/hardirq.h     |   3 +
 arch/x86/include/asm/idtentry.h    |   4 +
 arch/x86/include/asm/irq_vectors.h |   7 +-
 arch/x86/include/asm/tdx.h         |   4 +
 arch/x86/include/uapi/asm/tdx.h    |  76 ++++++++
 arch/x86/kernel/irq.c              |   7 +
 9 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/coco/tdx/attest.c
 create mode 100644 arch/x86/include/uapi/asm/tdx.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-01 18:34 [PATCH v5 0/3] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
@ 2022-05-01 18:34 ` Kuppuswamy Sathyanarayanan
  2022-05-02  2:31   ` Kai Huang
  2022-05-02 12:18   ` Wander Lairson Costa
  2022-05-01 18:34 ` [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
  2022-05-01 18:35 ` [PATCH v5 3/3] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2 siblings, 2 replies; 56+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-05-01 18:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

In TDX guest, attestation is used to verify the trustworthiness of a TD
to other entities before provisioning secrets to the TD.
   
One usage example is, when a TD guest uses encrypted drive and if the
decryption keys required to access the drive are stored in a secure 3rd
party keyserver, the key server can use attestation to verify TD's
trustworthiness and release the decryption keys to the TD.

The attestation process consists of two steps: TDREPORT generation and
Quote generation.

TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
the TDX module which contains TD-specific information (such as TD
measurements), platform security version, and the MAC to protect the
integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
get the TDREPORT from the TDX module. A user-provided 64-Byte
REPORTDATA is used as input and included in the TDREPORT. Typically it
can be some nonce provided by attestation service so the TDREPORT can
be verified uniquely. More details about TDREPORT can be found in
Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".

After getting the TDREPORT, the second step of the attestation process
is to send the TDREPORT to Quoting Enclave (QE) or Quote Generation
Service (QGS) to generate the Quote. However the method of sending the
TDREPORT to QE/QGS, communication channel used and data format used is
specific to the implementation of QE/QGS.

A typical implementation is, TD userspace attestation software gets the
TDREPORT from TD kernel, sends it to QE/QGS, and QE/QGS returns the
Quote. TD attestation software can use any available communication
channel to talk to QE/QGS, such as using vsock and tcp/ip.

To support the case that those communication channels are not directly
available to the TD, TDX also defines TDVMCALLs to allow TD to ask VMM
to help with sending the TDREPORT and receiving the Quote. This support
is documented in the GHCI spec section titled "5.4 TD attestation".

Implement a basic attestation driver to allow TD userspace to get the
TDREPORT, which is sent to QE by the attestation software to generate
a Quote for remote verification.

Also note that explicit access permissions are not enforced in this
driver because the quote and measurements are not a secret. However
the access permissions of the device node can be used to set any
desired access policy. The udev default is usually root access
only.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/Makefile      |   2 +-
 arch/x86/coco/tdx/attest.c      | 138 ++++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/tdx.h |  36 +++++++++
 3 files changed, 175 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/coco/tdx/attest.c
 create mode 100644 arch/x86/include/uapi/asm/tdx.h

diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..d2db3e6770e5 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdcall.o attest.o
diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
new file mode 100644
index 000000000000..4543a0264ce7
--- /dev/null
+++ b/arch/x86/coco/tdx/attest.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * attest.c - TDX guest attestation interface driver.
+ *
+ * Implements user interface to trigger attestation process.
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *
+ */
+
+#define pr_fmt(fmt) "x86/tdx: attest: " fmt
+
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/uaccess.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/set_memory.h>
+#include <linux/dma-mapping.h>
+#include <linux/jiffies.h>
+#include <linux/io.h>
+#include <asm/apic.h>
+#include <asm/tdx.h>
+#include <asm/irq_vectors.h>
+#include <uapi/asm/tdx.h>
+
+#define DRIVER_NAME "tdx-attest"
+
+/* TDREPORT module call leaf ID */
+#define TDX_GET_REPORT			4
+
+/* Upper 32 bits has the status code, so mask it */
+#define TDCALL_STATUS_CODE_MASK		0xffffffff00000000
+#define TDCALL_STATUS_CODE(a)		((a) & TDCALL_STATUS_CODE_MASK)
+
+static struct miscdevice miscdev;
+
+static long tdx_get_report(void __user *argp)
+{
+	void *reportdata = NULL, *tdreport = NULL;
+	long ret = 0;
+
+	/* Allocate buffer space for REPORTDATA */
+	reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
+	if (!reportdata)
+		return -ENOMEM;
+
+	/* Allocate buffer space for TDREPORT */
+	tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
+	if (!tdreport) {
+		ret = -ENOMEM;
+		goto failed;
+	}
+
+	/* Copy REPORTDATA from the user buffer */
+	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
+		ret = -EFAULT;
+		goto failed;
+	}
+
+	/*
+	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
+	 *
+	 * Pass the physical address of user generated REPORTDATA
+	 * and the physical address of the output buffer to the TDX
+	 * module to generate the TDREPORT. Generated data contains
+	 * measurements/configuration data of the TD guest. More info
+	 * about ABI can be found in TDX 1.0 Module specification, sec
+	 * titled "TDG.MR.REPORT".
+	 */
+	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
+				virt_to_phys(reportdata), 0, 0, NULL);
+	if (ret) {
+		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
+				TDCALL_STATUS_CODE(ret));
+		ret = -EIO;
+		goto failed;
+	}
+
+	/* Copy TDREPORT back to the user buffer */
+	if (copy_to_user(argp, tdreport, TDX_REPORT_LEN))
+		ret = -EFAULT;
+
+failed:
+	kfree(reportdata);
+	kfree(tdreport);
+	return ret;
+}
+
+static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
+			     unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	long ret = 0;
+
+	switch (cmd) {
+	case TDX_CMD_GET_REPORT:
+		ret = tdx_get_report(argp);
+		break;
+	default:
+		pr_debug("cmd %d not supported\n", cmd);
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static const struct file_operations tdx_attest_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= tdx_attest_ioctl,
+	.llseek		= no_llseek,
+};
+
+static int __init tdx_attestation_init(void)
+{
+	long ret;
+
+	/* Make sure we are in a valid TDX platform */
+	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+		return -EIO;
+
+	miscdev.name = DRIVER_NAME;
+	miscdev.minor = MISC_DYNAMIC_MINOR;
+	miscdev.fops = &tdx_attest_fops;
+
+	ret = misc_register(&miscdev);
+	if (ret) {
+		pr_err("misc device registration failed\n");
+		return ret;
+	}
+
+	pr_debug("module initialization success\n");
+
+	return 0;
+}
+device_initcall(tdx_attestation_init)
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
new file mode 100644
index 000000000000..9a7377723667
--- /dev/null
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_TDX_H
+#define _UAPI_ASM_X86_TDX_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* Length of the REPORTDATA used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORTDATA_LEN		64
+
+/* Length of TDREPORT used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORT_LEN			1024
+
+/**
+ * struct tdx_report_req: Get TDREPORT from the TDX module.
+ *
+ * @reportdata : User-defined 64-Byte REPORTDATA to be included into
+ *		 TDREPORT. Typically it can be some nonce provided by
+ *		 attestation software so the generated TDREPORT can be
+ *		 uniquely verified.
+ * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
+ *		 TDX_REPORT_LEN.
+ *
+ * Used in TDX_CMD_GET_REPORT IOCTL request.
+ */
+struct tdx_report_req {
+	union {
+		__u8 reportdata[TDX_REPORTDATA_LEN];
+		__u8 tdreport[TDX_REPORT_LEN];
+	};
+};
+
+/* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */
+#define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
+
+#endif /* _UAPI_ASM_X86_TDX_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support
  2022-05-01 18:34 [PATCH v5 0/3] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
  2022-05-01 18:34 ` [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
@ 2022-05-01 18:34 ` Kuppuswamy Sathyanarayanan
  2022-05-02 12:44   ` Wander Lairson Costa
  2022-05-01 18:35 ` [PATCH v5 3/3] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2 siblings, 1 reply; 56+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-05-01 18:34 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

Host-guest event notification via configured interrupt vector is useful
in cases where a guest makes an asynchronous request and needs a
callback from the host to indicate the completion or to let the host
notify the guest about events like device removal. One usage example is,
callback requirement of GetQuote asynchronous hypercall.

In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
guest to specify which interrupt vector to use as an event-notify
vector to the VMM. Details about the SetupEventNotifyInterrupt
hypercall can be found in TDX Guest-Host Communication Interface
(GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
Add a tdx_hcall_set_notify_intr() helper function to implement the
SetupEventNotifyInterrupt hypercall.

Reserve 0xec IRQ vector address for TDX guest to receive the event
completion notification from VMM. Also add related IDT handler to
process the notification event.

Add support to track the notification event status via
/proc/interrupts.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/tdx.c            | 73 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/hardirq.h     |  3 ++
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/include/asm/irq_vectors.h |  7 ++-
 arch/x86/include/asm/tdx.h         |  4 ++
 arch/x86/kernel/irq.c              |  7 +++
 6 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 03deb4d6920d..b49211994864 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -11,6 +11,10 @@
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/apic.h>
+#include <asm/idtentry.h>
+#include <asm/irq_regs.h>
+#include <asm/desc.h>
 
 /* TDX module Call Leaf IDs */
 #define TDX_GET_INFO			1
@@ -19,6 +23,7 @@
 
 /* TDX hypercall Leaf IDs */
 #define TDVMCALL_MAP_GPA		0x10001
+#define TDVMCALL_SETUP_NOTIFY_INTR	0x10004
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -34,6 +39,28 @@
 #define VE_GET_PORT_NUM(e)	((e) >> 16)
 #define VE_IS_IO_STRING(e)	((e) & BIT(4))
 
+/*
+ * Handler used to report notifications about
+ * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
+ * used only by the attestation driver. So, race condition
+ * with read/write operation is not considered.
+ */
+static void (*tdx_event_notify_handler)(void);
+
+/* Helper function to register tdx_event_notify_handler */
+void tdx_setup_ev_notify_handler(void (*handler)(void))
+{
+	tdx_event_notify_handler = handler;
+}
+EXPORT_SYMBOL_GPL(tdx_setup_ev_notify_handler);
+
+/* Helper function to unregister tdx_event_notify_handler */
+void tdx_remove_ev_notify_handler(void)
+{
+	tdx_event_notify_handler = NULL;
+}
+EXPORT_SYMBOL_GPL(tdx_remove_ev_notify_handler);
+
 /*
  * Wrapper for standard use of __tdx_hypercall with no output aside from
  * return code.
@@ -98,6 +125,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
 		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
 }
 
+/* TDX guest event notification handler */
+DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+
+	inc_irq_stat(irq_tdx_event_notify_count);
+
+	if (tdx_event_notify_handler)
+		tdx_event_notify_handler();
+
+	ack_APIC_irq();
+
+	set_irq_regs(old_regs);
+}
+
+/*
+ * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
+ *
+ * @vector: Vector address to be used for notification.
+ *
+ * return 0 on success or failure error number.
+ */
+static long tdx_hcall_set_notify_intr(u8 vector)
+{
+	/* Minimum vector value allowed is 32 */
+	if (vector < 32)
+		return -EINVAL;
+
+	/*
+	 * Register callback vector address with VMM. More details
+	 * about the ABI can be found in TDX Guest-Host-Communication
+	 * Interface (GHCI), sec titled
+	 * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
+	 */
+	if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
+		return -EIO;
+
+	return 0;
+}
+
 static u64 get_cc_mask(void)
 {
 	struct tdx_module_output out;
@@ -688,5 +755,11 @@ void __init tdx_early_init(void)
 	x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
 	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
 
+	alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
+			asm_sysvec_tdx_event_notify);
+
+	if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
+		pr_warn("Setting event notification interrupt failed\n");
+
 	pr_info("Guest detected\n");
 }
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 275e7fd20310..582deff56210 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+	unsigned int irq_tdx_event_notify_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..655086dd940e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_xen_hvm_callback);
 DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_kvm_asyncpf_interrupt);
 #endif
 
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR,	sysvec_tdx_event_notify);
+#endif
+
 #undef X86_TRAP_OTHER
 
 #endif
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..82ac0c0a34b1 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -104,7 +104,12 @@
 #define HYPERV_STIMER0_VECTOR		0xed
 #endif
 
-#define LOCAL_TIMER_VECTOR		0xec
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+/* Vector on which TDX Guest event notification is delivered */
+#define TDX_GUEST_EVENT_NOTIFY_VECTOR	0xec
+#endif
+
+#define LOCAL_TIMER_VECTOR		0xeb
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..eb4db837cc44 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,10 @@ void tdx_safe_halt(void);
 
 bool tdx_early_handle_ve(struct pt_regs *regs);
 
+void tdx_setup_ev_notify_handler(void (*handler)(void));
+
+void tdx_remove_ev_notify_handler(void);
+
 #else
 
 static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..a96ecd866723 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+	seq_printf(p, "%*s: ", prec, "TGN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->irq_tdx_event_notify_count);
+	seq_puts(p, "  TDX Guest event notification\n");
 #endif
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-01 18:34 [PATCH v5 0/3] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
  2022-05-01 18:34 ` [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
  2022-05-01 18:34 ` [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
@ 2022-05-01 18:35 ` Kuppuswamy Sathyanarayanan
  2022-05-02  2:40   ` Kai Huang
                     ` (2 more replies)
  2 siblings, 3 replies; 56+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-05-01 18:35 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kuppuswamy Sathyanarayanan, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Kai Huang, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

In TDX guest, the second stage in attestation process is to send the
TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
not support communication channels like vsock or TCP/IP, implement
support to get TD Quote using TDVMCALL. GetQuote hypercall can be used
by the TD guest to request VMM facilitate the Quote generation via
QE/QGS. More details about GetQuote hypercall can be found in TDX
Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
titled "TDG.VP.VMCALL<GetQuote>.

Since GetQuote is an asynchronous request hypercall, it will not block
till the TD Quote is generated. So VMM uses callback interrupt vector
configured by SetupEventNotifyInterrupt hypercall to notify the guest
about Quote generation completion or failure. Upon receiving the
completion notification, status can be found in the TD Quote request
header.

Although GHCI specification does not restrict parallel GetQuote
requests, since Quote generation is not in performance critical path
and the frequency of attestation requests are expected to be low, only
support serialized quote generation requests. Parallel quote request
support can be added once demand arises.

Currently the buffer required to get the TD Quote data and its shared
mapping conversion (using set_memory_decrypted()) is handled within the
IOCTL handler. Although it will increase the TDX_CMD_GET_QUOTE IOCTL
response time,  it is negligible compared to the time required for the
quote generation completion. So IOCTL performance optimization is not
considered at this time.

Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 arch/x86/coco/tdx/attest.c      | 135 ++++++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/tdx.h |  40 ++++++++++
 2 files changed, 175 insertions(+)

diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
index 4543a0264ce7..6aba85297708 100644
--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <asm/apic.h>
 #include <asm/tdx.h>
+#include <asm/coco.h>
 #include <asm/irq_vectors.h>
 #include <uapi/asm/tdx.h>
 
@@ -29,12 +30,20 @@
 
 /* TDREPORT module call leaf ID */
 #define TDX_GET_REPORT			4
+/* GetQuote hypercall leaf ID */
+#define TDVMCALL_GET_QUOTE             0x10002
 
 /* Upper 32 bits has the status code, so mask it */
 #define TDCALL_STATUS_CODE_MASK		0xffffffff00000000
 #define TDCALL_STATUS_CODE(a)		((a) & TDCALL_STATUS_CODE_MASK)
 
 static struct miscdevice miscdev;
+/* Completion object to track GetQuote completion status */
+static DECLARE_COMPLETION(req_compl);
+/* Mutex to serialize GetQuote requests */
+static DEFINE_MUTEX(quote_lock);
+/* Pointer to track current quote request */
+static void *tdquote;
 
 static long tdx_get_report(void __user *argp)
 {
@@ -88,6 +97,126 @@ static long tdx_get_report(void __user *argp)
 	return ret;
 }
 
+/* tdx_get_quote_hypercall() - Request to get TD Quote using TDREPORT.
+ *
+ * @quote_buf	: Address of 4KB aligned GPA memory which contains
+ *		  TDREPORT. It is also used as output buffer to copy
+ *		  the GetQuote output upon successful execution.
+ * @quote_len	: Length of the GPA in bytes.
+ *
+ * Return hypercall status code.
+ */
+static long tdx_get_quote_hypercall(void *quote_buf, u64 quote_len)
+{
+	struct tdx_hypercall_args args = {0};
+
+	args.r10 = TDX_HYPERCALL_STANDARD;
+	args.r11 = TDVMCALL_GET_QUOTE;
+	args.r12 = cc_mkdec(virt_to_phys(quote_buf));
+	args.r13 = quote_len;
+
+	/*
+	 * Pass the physical address of TDREPORT to the VMM and
+	 * trigger the Quote generation. It is not a blocking
+	 * call, hence completion of this request will be notified to
+	 * the TD guest via a callback interrupt. More info about ABI
+	 * can be found in TDX Guest-Host-Communication Interface
+	 * (GHCI), sec titled "TDG.VP.VMCALL<GetQuote>".
+	 */
+	return __tdx_hypercall(&args, 0);
+}
+
+static long tdx_get_quote(void __user *argp)
+{
+	struct tdx_quote_req quote_req;
+	long ret = 0;
+	int order;
+
+	/* Hold lock to serialize GetQuote requests */
+	mutex_lock(&quote_lock);
+
+	reinit_completion(&req_compl);
+
+	/* Copy GetQuote request struct from user buffer */
+	if (copy_from_user(&quote_req, argp, sizeof(struct tdx_quote_req))) {
+		ret = -EFAULT;
+		goto quote_failed;
+	}
+
+	/* Make sure the length & timeout is valid */
+	if (!quote_req.len || !quote_req.timeout) {
+		ret = -EINVAL;
+		goto quote_failed;
+	}
+
+	/* Get order for Quote buffer page allocation */
+	order = get_order(quote_req.len);
+
+	/*
+	 * Allocate buffer to get TD Quote from the VMM.
+	 * Size needs to be 4KB aligned (which is already
+	 * met in page allocation).
+	 */
+	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+	if (!tdquote) {
+		ret = -ENOMEM;
+		goto quote_failed;
+	}
+
+	/*
+	 * Since this buffer will be shared with the VMM via GetQuote
+	 * hypercall, decrypt it.
+	 */
+	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
+	if (ret)
+		goto quote_failed;
+
+	/* Copy TDREPORT from user buffer to kernel Quote buffer */
+	if (copy_from_user(tdquote, (void __user *)quote_req.buf, quote_req.len)) {
+		ret = -EFAULT;
+		goto quote_failed;
+	}
+
+	/* Submit GetQuote Request */
+	ret = tdx_get_quote_hypercall(tdquote, (1ULL << order) * PAGE_SIZE);
+	if (ret) {
+		pr_err("GetQuote hypercall failed, status:%lx\n", ret);
+		ret = -EIO;
+		goto quote_failed;
+	}
+
+	/* Wait for attestation completion */
+	ret = wait_for_completion_interruptible(&req_compl);
+	if (ret <= 0) {
+		ret = -EIO;
+		goto quote_failed;
+	}
+
+	/* Copy output data back to user buffer */
+	if (copy_to_user((void __user *)quote_req.buf, tdquote, quote_req.len))
+		ret = -EFAULT;
+
+quote_failed:
+	if (tdquote)
+		free_pages((unsigned long)tdquote, order);
+	tdquote = NULL;
+	mutex_unlock(&quote_lock);
+	return ret;
+}
+
+static void attestation_callback_handler(void)
+{
+	struct tdx_quote_hdr *quote_hdr;
+
+	quote_hdr = (struct tdx_quote_hdr *) tdquote;
+
+	/* Check for spurious callback IRQ case */
+	if (!tdquote || quote_hdr->status == GET_QUOTE_IN_FLIGHT)
+		return;
+
+	complete(&req_compl);
+}
+
 static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
 			     unsigned long arg)
 {
@@ -98,6 +227,9 @@ static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
 	case TDX_CMD_GET_REPORT:
 		ret = tdx_get_report(argp);
 		break;
+	case TDX_CMD_GET_QUOTE:
+		ret = tdx_get_quote(argp);
+		break;
 	default:
 		pr_debug("cmd %d not supported\n", cmd);
 		ret = -EINVAL;
@@ -121,6 +253,9 @@ static int __init tdx_attestation_init(void)
 	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
 		return -EIO;
 
+	/* Register attestation event notify handler */
+	tdx_setup_ev_notify_handler(attestation_callback_handler);
+
 	miscdev.name = DRIVER_NAME;
 	miscdev.minor = MISC_DYNAMIC_MINOR;
 	miscdev.fops = &tdx_attest_fops;
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
index 9a7377723667..3eb60253432c 100644
--- a/arch/x86/include/uapi/asm/tdx.h
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -33,4 +33,44 @@ struct tdx_report_req {
 /* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */
 #define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
 
+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf		: Pass user data that includes TDREPORT as input. Upon
+ *		  successful completion of IOCTL, output is copied
+ *		  back to the same buffer.
+ * @len		: Length of the buffer.
+ */
+struct tdx_quote_req {
+	__u64 buf;
+	__u64 len;
+	__u32 timeout;
+};
+
+/* Get TD Quote from QE/QGS using TDREPORT passed by the user */
+#define TDX_CMD_GET_QUOTE		_IOR('T', 0x02, struct tdx_quote_req)
+
+/* TD Quote status codes */
+#define GET_QUOTE_SUCCESS		0
+#define GET_QUOTE_IN_FLIGHT		0xffffffffffffffff
+#define GET_QUOTE_ERROR			0x8000000000000000
+#define GET_QUOTE_SERVICE_UNAVAILABLE	0x8000000000000001
+
+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+	/* Quote version, filled by TD */
+	__u64 version;
+	/* Status code of Quote request, filled by VMM */
+	__u64 status;
+	/* Length of TDREPORT, filled by TD */
+	__u32 in_len;
+	/* Length of Quote, filled by VMM */
+	__u32 out_len;
+	/* Actual Quote data */
+	__u64 data[0];
+};
+
 #endif /* _UAPI_ASM_X86_TDX_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-01 18:34 ` [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
@ 2022-05-02  2:31   ` Kai Huang
  2022-05-02 15:52     ` Sathyanarayanan Kuppuswamy
  2022-05-02 12:18   ` Wander Lairson Costa
  1 sibling, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-02  2:31 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Sun, 2022-05-01 at 11:34 -0700, Kuppuswamy Sathyanarayanan wrote:
> In TDX guest, attestation is used to verify the trustworthiness of a TD
> to other entities before provisioning secrets to the TD.
>    
> One usage example is, when a TD guest uses encrypted drive and if the
> decryption keys required to access the drive are stored in a secure 3rd
> party keyserver, the key server can use attestation to verify TD's
> trustworthiness and release the decryption keys to the TD.
> 
> The attestation process consists of two steps: TDREPORT generation and
> Quote generation.
> 
> TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
> the TDX module which contains TD-specific information (such as TD
> measurements), platform security version, and the MAC to protect the
> integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
> get the TDREPORT from the TDX module. A user-provided 64-Byte
> REPORTDATA is used as input and included in the TDREPORT. Typically it
> can be some nonce provided by attestation service so the TDREPORT can
> be verified uniquely. More details about TDREPORT can be found in
> Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".
> 
> After getting the TDREPORT, the second step of the attestation process
> is to send the TDREPORT to Quoting Enclave (QE) or Quote Generation
> Service (QGS) to generate the Quote. However the method of sending the
> TDREPORT to QE/QGS, communication channel used and data format used is
> specific to the implementation of QE/QGS.
> 
> A typical implementation is, TD userspace attestation software gets the
> TDREPORT from TD kernel, sends it to QE/QGS, and QE/QGS returns the
> Quote. TD attestation software can use any available communication
> channel to talk to QE/QGS, such as using vsock and tcp/ip.
> 
> To support the case that those communication channels are not directly
> available to the TD, TDX also defines TDVMCALLs to allow TD to ask VMM
> to help with sending the TDREPORT and receiving the Quote. This support
> is documented in the GHCI spec section titled "5.4 TD attestation".

Sorry I didn't think thoroughly in the reply to v4.  I think it's still
necessary to mention TDREPORT can only be verified locally, because otherwise
Quote isn't conceptually needed.  And above 3 paragraphs are too verbose I
guess.  How about below:

"
TDREPORT can only be verified locally as the MAC key is bound to the platform. 
TDX attestation leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT
locally and convert it to a remote verifiable Quote to support remote
attestation of the TDREPORT. 

TD userspace attestation software firstly gets the TDREPORT from the TD kernel,
and then sends it to the QE to generate a TD Quote.  The QE sends the Quote back
after it is generated.  How is the data sent and received is QE implementation
and deployment specific.  TD userspace attestation software can use whatever
communication channel available (i.e. tcp/ip, vsock) to communicate with the QE
using whatever data format.  TDX also defines TDVMCALLs to allow TD to ask VMM
to facilitate sending/receiving data between TD attestation software and the QE.
This support is documented in GHCI 1.0 spec "5.4 TD attestation".

Implement a basic attestation driver to ...
"

> 
> Implement a basic attestation driver to allow TD userspace to get the
> TDREPORT, which is sent to QE by the attestation software to generate
> a Quote for remote verification.
> 
> Also note that explicit access permissions are not enforced in this
> driver because the quote and measurements are not a secret. However
> the access permissions of the device node can be used to set any
> desired access policy. The udev default is usually root access
> only.

The IOCTL vs /sysfs isn't discussed.

For instance, after rough thinking, why is the IOCTL better than below approach
using /sysfs?

echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
cat /sys/kernel/coco/tdx/attest/tdreport

Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver to call
TDCALL to get the TDREPORT, which is available at '/sys/.../tdreport'.

The benefit of using IOCTL I can think of now is it is perhaps more secure, as
with IOCTL the REPORTDATA and the TDREPORT is visible to the process which calls
the IOCTL, while using the /sysfs they are potentially visible to any process. 
Especially the REPORTDATA, i.e. it can come from attestation service after the
TD attestation agent sets up a secure connection with it.

Anyway, my 2cents above.

> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/coco/tdx/Makefile      |   2 +-
>  arch/x86/coco/tdx/attest.c      | 138 ++++++++++++++++++++++++++++++++
>  arch/x86/include/uapi/asm/tdx.h |  36 +++++++++
>  3 files changed, 175 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/coco/tdx/attest.c
>  create mode 100644 arch/x86/include/uapi/asm/tdx.h
> 
> diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
> index 46c55998557d..d2db3e6770e5 100644
> --- a/arch/x86/coco/tdx/Makefile
> +++ b/arch/x86/coco/tdx/Makefile
> @@ -1,3 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  
> -obj-y += tdx.o tdcall.o
> +obj-y += tdx.o tdcall.o attest.o
> diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
> new file mode 100644
> index 000000000000..4543a0264ce7
> --- /dev/null
> +++ b/arch/x86/coco/tdx/attest.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * attest.c - TDX guest attestation interface driver.
> + *
> + * Implements user interface to trigger attestation process.
> + *
> + * Copyright (C) 2022 Intel Corporation
> + *
> + */
> +
> +#define pr_fmt(fmt) "x86/tdx: attest: " fmt
> +
> +#include <linux/module.h>

Still need this?

> +#include <linux/miscdevice.h>
> +#include <linux/uaccess.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/set_memory.h>
> +#include <linux/dma-mapping.h>

Still need above two?

> +#include <linux/jiffies.h>

Don't need this either.

> +#include <linux/io.h>
> +#include <asm/apic.h>
> +#include <asm/tdx.h>
> +#include <asm/irq_vectors.h>
> +#include <uapi/asm/tdx.h>

Please get rid of unnecessary headers.

> +
> +#define DRIVER_NAME "tdx-attest"
> +
> +/* TDREPORT module call leaf ID */
> +#define TDX_GET_REPORT			4
> +
> +/* Upper 32 bits has the status code, so mask it */
> +#define TDCALL_STATUS_CODE_MASK		0xffffffff00000000
> +#define TDCALL_STATUS_CODE(a)		((a) & TDCALL_STATUS_CODE_MASK)
> +
> +static struct miscdevice miscdev;
> +
> +static long tdx_get_report(void __user *argp)
> +{
> +	void *reportdata = NULL, *tdreport = NULL;
> +	long ret = 0;
> +
> +	/* Allocate buffer space for REPORTDATA */
> +	reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
> +	if (!reportdata)
> +		return -ENOMEM;
> +
> +	/* Allocate buffer space for TDREPORT */
> +	tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
> +	if (!tdreport) {
> +		ret = -ENOMEM;
> +		goto failed;

Perhaps 'failed' -> 'out'.  That code is for both error and non-error case.

> +	}
> +
> +	/* Copy REPORTDATA from the user buffer */
> +	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
> +		ret = -EFAULT;
> +		goto failed;
> +	}
> +
> +	/*
> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
> +	 *
> +	 * Pass the physical address of user generated REPORTDATA
> +	 * and the physical address of the output buffer to the TDX
> +	 * module to generate the TDREPORT. Generated data contains
> +	 * measurements/configuration data of the TD guest. More info
> +	 * about ABI can be found in TDX 1.0 Module specification, sec
> +	 * titled "TDG.MR.REPORT".

I guess you can get rid of the entire second paragraph.  If the reference to the
spec is useful, then keep it but other sentences are not quite useful.  Perhaps:

	Get the TDREPORT using REPORTDATA as input.  Refer to 22.3.3 
	TDG.MR.REPORT leaf in the TDX Module 1.0 Specification for detail
	information.

> +	 */
> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
> +				virt_to_phys(reportdata), 0, 0, NULL);
> +	if (ret) {
> +		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
> +				TDCALL_STATUS_CODE(ret));

You can just print out the exact error code.  It's more informative and can help
to debug.

> +		ret = -EIO;
> +		goto failed;
> +	}
> +
> +	/* Copy TDREPORT back to the user buffer */
> +	if (copy_to_user(argp, tdreport, TDX_REPORT_LEN))
> +		ret = -EFAULT;
> +
> +failed:
> +	kfree(reportdata);
> +	kfree(tdreport);
> +	return ret;
> +}
> +
> +static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
> +			     unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	long ret = 0;

If you initialize ret to -EINVAL here, then ...
> +
> +	switch (cmd) {
> +	case TDX_CMD_GET_REPORT:
> +		ret = tdx_get_report(argp);
> +		break;
> +	default:
> +		pr_debug("cmd %d not supported\n", cmd);
> +		ret = -EINVAL;

You don't have to set it here.

> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static const struct file_operations tdx_attest_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= tdx_attest_ioctl,
> +	.llseek		= no_llseek,
> +};
> +
> +static int __init tdx_attestation_init(void)
> +{
> +	long ret;
> +
> +	/* Make sure we are in a valid TDX platform */
> +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> +		return -EIO;
> +
> +	miscdev.name = DRIVER_NAME;
> +	miscdev.minor = MISC_DYNAMIC_MINOR;
> +	miscdev.fops = &tdx_attest_fops;
> +
> +	ret = misc_register(&miscdev);
> +	if (ret) {
> +		pr_err("misc device registration failed\n");

pr_debug() is used when __tdx_module_call() fails, and in the default case in
tdx_attest_ioctl() too.

Shouldn't those error msg be printed using the same way?

> +		return ret;
> +	}
> +
> +	pr_debug("module initialization success\n");

I don't think it's a module anymore?

Also perhaps just pr_info()?

> +
> +	return 0;
> +}
> +device_initcall(tdx_attestation_init)
> diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
> new file mode 100644
> index 000000000000..9a7377723667
> --- /dev/null
> +++ b/arch/x86/include/uapi/asm/tdx.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_ASM_X86_TDX_H
> +#define _UAPI_ASM_X86_TDX_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/* Length of the REPORTDATA used in TDG.MR.REPORT TDCALL */
> +#define TDX_REPORTDATA_LEN		64
> +
> +/* Length of TDREPORT used in TDG.MR.REPORT TDCALL */
> +#define TDX_REPORT_LEN			1024

I prefer TDX_TDREPORT_LEN.

> +
> +/**
> + * struct tdx_report_req: Get TDREPORT from the TDX module.

Just get the TDREPORT is enough I guess.

> + *
> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
> + *		 TDREPORT. Typically it can be some nonce provided by
> + *		 attestation software so the generated TDREPORT can be
> + *		 uniquely verified.
> + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
> + *		 TDX_REPORT_LEN.
> + *
> + * Used in TDX_CMD_GET_REPORT IOCTL request.
> + */
> +struct tdx_report_req {
> +	union {
> +		__u8 reportdata[TDX_REPORTDATA_LEN];
> +		__u8 tdreport[TDX_REPORT_LEN];
> +	};
> +};

I am not sure overriding the input is a good idea, but will leave to others.

> +
> +/* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */

Just get the TDREPORT is enough I guess.

> +#define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
> +
> +#endif /* _UAPI_ASM_X86_TDX_H */


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-01 18:35 ` [PATCH v5 3/3] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
@ 2022-05-02  2:40   ` Kai Huang
  2022-05-03  1:27     ` Kirill A. Shutemov
  2022-05-02  5:01   ` Kai Huang
  2022-05-02 13:16   ` Wander Lairson Costa
  2 siblings, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-02  2:40 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel


> +
> +	/* Get order for Quote buffer page allocation */
> +	order = get_order(quote_req.len);
> +
> +	/*
> +	 * Allocate buffer to get TD Quote from the VMM.
> +	 * Size needs to be 4KB aligned (which is already
> +	 * met in page allocation).
> +	 */
> +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> +	if (!tdquote) {
> +		ret = -ENOMEM;
> +		goto quote_failed;
> +	}

You can use alloc_pages_exact().

> +
> +	/*
> +	 * Since this buffer will be shared with the VMM via GetQuote
> +	 * hypercall, decrypt it.
> +	 */
> +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> +	if (ret)
> +		goto quote_failed;


Again, Dave and Andi already commented you should use vmap() to avoid breaking
up the direct-mapping.  Please use vmap() instead.

https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/

Will review the rest later.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-01 18:35 ` [PATCH v5 3/3] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2022-05-02  2:40   ` Kai Huang
@ 2022-05-02  5:01   ` Kai Huang
  2022-05-02 16:02     ` Sathyanarayanan Kuppuswamy
  2022-05-02 13:16   ` Wander Lairson Costa
  2 siblings, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-02  5:01 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel


> +
> +static long tdx_get_quote(void __user *argp)
> +{
> +	struct tdx_quote_req quote_req;
> +	long ret = 0;
> +	int order;
> +
> +	/* Hold lock to serialize GetQuote requests */
> +	mutex_lock(&quote_lock);
> +
> +	reinit_completion(&req_compl);
> +
> +	/* Copy GetQuote request struct from user buffer */
> +	if (copy_from_user(&quote_req, argp, sizeof(struct tdx_quote_req))) {
> +		ret = -EFAULT;
> +		goto quote_failed;
> +	}
> +
> +	/* Make sure the length & timeout is valid */
> +	if (!quote_req.len || !quote_req.timeout) {
> +		ret = -EINVAL;
> +		goto quote_failed;
> +	}
> +
> +	/* Get order for Quote buffer page allocation */
> +	order = get_order(quote_req.len);
> +
> +	/*
> +	 * Allocate buffer to get TD Quote from the VMM.
> +	 * Size needs to be 4KB aligned (which is already
> +	 * met in page allocation).
> +	 */
> +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> +	if (!tdquote) {
> +		ret = -ENOMEM;
> +		goto quote_failed;
> +	}
> +
> +	/*
> +	 * Since this buffer will be shared with the VMM via GetQuote
> +	 * hypercall, decrypt it.
> +	 */
> +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> +	if (ret)
> +		goto quote_failed;
> +
> +	/* Copy TDREPORT from user buffer to kernel Quote buffer */
> +	if (copy_from_user(tdquote, (void __user *)quote_req.buf, quote_req.len)) {
> +		ret = -EFAULT;
> +		goto quote_failed;
> +	}
> +
> +	/* Submit GetQuote Request */
> +	ret = tdx_get_quote_hypercall(tdquote, (1ULL << order) * PAGE_SIZE);
> +	if (ret) {
> +		pr_err("GetQuote hypercall failed, status:%lx\n", ret);
> +		ret = -EIO;
> +		goto quote_failed;
> +	}
> +
> +	/* Wait for attestation completion */
> +	ret = wait_for_completion_interruptible(&req_compl);
> +	if (ret <= 0) {
> +		ret = -EIO;
> +		goto quote_failed;
> +	}
> +
> +	/* Copy output data back to user buffer */
> +	if (copy_to_user((void __user *)quote_req.buf, tdquote, quote_req.len))
> +		ret = -EFAULT;
> +
> +quote_failed:
> +	if (tdquote)
> +		free_pages((unsigned long)tdquote, order);

The buffer is freed w/o being converted back to private.  How can you prevent
the buffer from being allocated by kernel and used as private pages again?

Also, the  buffer may be still used by VMM when timeout (IN_FLIGHT), how can
this even work?

> +	tdquote = NULL;
> +	mutex_unlock(&quote_lock);
> +	return ret;
> +}
> +
> +static void attestation_callback_handler(void)
> +{
> +	struct tdx_quote_hdr *quote_hdr;
> +
> +	quote_hdr = (struct tdx_quote_hdr *) tdquote;
> +
> +	/* Check for spurious callback IRQ case */
> +	if (!tdquote || quote_hdr->status == GET_QUOTE_IN_FLIGHT)
> +		return;

I don't get the logic.  Please explain.

> +
> +	complete(&req_compl);
> +}
> +


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-01 18:34 ` [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
  2022-05-02  2:31   ` Kai Huang
@ 2022-05-02 12:18   ` Wander Lairson Costa
  2022-05-02 16:06     ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 56+ messages in thread
From: Wander Lairson Costa @ 2022-05-02 12:18 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Sun, May 01, 2022 at 11:34:58AM -0700, Kuppuswamy Sathyanarayanan wrote:

[snip]

> +
> +static struct miscdevice miscdev;
> +
> +static long tdx_get_report(void __user *argp)
> +{
> +	void *reportdata = NULL, *tdreport = NULL;
> +	long ret = 0;
> +
> +	/* Allocate buffer space for REPORTDATA */
> +	reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
> +	if (!reportdata)
> +		return -ENOMEM;
> +
> +	/* Allocate buffer space for TDREPORT */
> +	tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
> +	if (!tdreport) {
> +		ret = -ENOMEM;
> +		goto failed;
> +	}
> +
> +	/* Copy REPORTDATA from the user buffer */
> +	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
> +		ret = -EFAULT;
> +		goto failed;
> +	}
> +
> +	/*
> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
> +	 *
> +	 * Pass the physical address of user generated REPORTDATA
> +	 * and the physical address of the output buffer to the TDX
> +	 * module to generate the TDREPORT. Generated data contains
> +	 * measurements/configuration data of the TD guest. More info
> +	 * about ABI can be found in TDX 1.0 Module specification, sec
> +	 * titled "TDG.MR.REPORT".
> +	 */
> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
> +				virt_to_phys(reportdata), 0, 0, NULL);
> +	if (ret) {
> +		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
> +				TDCALL_STATUS_CODE(ret));

Should we use pr_err instead?

[snip]


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support
  2022-05-01 18:34 ` [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
@ 2022-05-02 12:44   ` Wander Lairson Costa
  0 siblings, 0 replies; 56+ messages in thread
From: Wander Lairson Costa @ 2022-05-02 12:44 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Sun, May 01, 2022 at 11:34:59AM -0700, Kuppuswamy Sathyanarayanan wrote:
> Host-guest event notification via configured interrupt vector is useful
> in cases where a guest makes an asynchronous request and needs a
> callback from the host to indicate the completion or to let the host
> notify the guest about events like device removal. One usage example is,
> callback requirement of GetQuote asynchronous hypercall.
> 
> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
> guest to specify which interrupt vector to use as an event-notify
> vector to the VMM. Details about the SetupEventNotifyInterrupt
> hypercall can be found in TDX Guest-Host Communication Interface
> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
> Add a tdx_hcall_set_notify_intr() helper function to implement the
> SetupEventNotifyInterrupt hypercall.
> 
> Reserve 0xec IRQ vector address for TDX guest to receive the event
> completion notification from VMM. Also add related IDT handler to
> process the notification event.
> 
> Add support to track the notification event status via
> /proc/interrupts.
> 
> Reviewed-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  arch/x86/coco/tdx/tdx.c            | 73 ++++++++++++++++++++++++++++++
>  arch/x86/include/asm/hardirq.h     |  3 ++
>  arch/x86/include/asm/idtentry.h    |  4 ++
>  arch/x86/include/asm/irq_vectors.h |  7 ++-
>  arch/x86/include/asm/tdx.h         |  4 ++
>  arch/x86/kernel/irq.c              |  7 +++
>  6 files changed, 97 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 03deb4d6920d..b49211994864 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -11,6 +11,10 @@
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
>  #include <asm/pgtable.h>
> +#include <asm/apic.h>
> +#include <asm/idtentry.h>
> +#include <asm/irq_regs.h>
> +#include <asm/desc.h>
>  
>  /* TDX module Call Leaf IDs */
>  #define TDX_GET_INFO			1
> @@ -19,6 +23,7 @@
>  
>  /* TDX hypercall Leaf IDs */
>  #define TDVMCALL_MAP_GPA		0x10001
> +#define TDVMCALL_SETUP_NOTIFY_INTR	0x10004
>  
>  /* MMIO direction */
>  #define EPT_READ	0
> @@ -34,6 +39,28 @@
>  #define VE_GET_PORT_NUM(e)	((e) >> 16)
>  #define VE_IS_IO_STRING(e)	((e) & BIT(4))
>  
> +/*
> + * Handler used to report notifications about
> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> + * used only by the attestation driver. So, race condition
> + * with read/write operation is not considered.
> + */
> +static void (*tdx_event_notify_handler)(void);
> +
> +/* Helper function to register tdx_event_notify_handler */
> +void tdx_setup_ev_notify_handler(void (*handler)(void))
> +{
> +	tdx_event_notify_handler = handler;
> +}
> +EXPORT_SYMBOL_GPL(tdx_setup_ev_notify_handler);
> +
> +/* Helper function to unregister tdx_event_notify_handler */
> +void tdx_remove_ev_notify_handler(void)
> +{
> +	tdx_event_notify_handler = NULL;
> +}
> +EXPORT_SYMBOL_GPL(tdx_remove_ev_notify_handler);
> +
>  /*
>   * Wrapper for standard use of __tdx_hypercall with no output aside from
>   * return code.
> @@ -98,6 +125,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>  }
>  
> +/* TDX guest event notification handler */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
> +{
> +	struct pt_regs *old_regs = set_irq_regs(regs);
> +
> +	inc_irq_stat(irq_tdx_event_notify_count);
> +
> +	if (tdx_event_notify_handler)
> +		tdx_event_notify_handler();
> +
> +	ack_APIC_irq();
> +
> +	set_irq_regs(old_regs);
> +}
> +
> +/*
> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
> + *
> + * @vector: Vector address to be used for notification.
> + *
> + * return 0 on success or failure error number.
> + */
> +static long tdx_hcall_set_notify_intr(u8 vector)
> +{
> +	/* Minimum vector value allowed is 32 */
> +	if (vector < 32)
> +		return -EINVAL;
> +
> +	/*
> +	 * Register callback vector address with VMM. More details
> +	 * about the ABI can be found in TDX Guest-Host-Communication
> +	 * Interface (GHCI), sec titled
> +	 * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
> +	 */
> +	if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
> +		return -EIO;
> +
> +	return 0;
> +}
> +
>  static u64 get_cc_mask(void)
>  {
>  	struct tdx_module_output out;
> @@ -688,5 +755,11 @@ void __init tdx_early_init(void)
>  	x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
>  	x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
>  
> +	alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
> +			asm_sysvec_tdx_event_notify);
> +
> +	if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
> +		pr_warn("Setting event notification interrupt failed\n");
> +
>  	pr_info("Guest detected\n");
>  }
> diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
> index 275e7fd20310..582deff56210 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -44,6 +44,9 @@ typedef struct {
>  	unsigned int irq_hv_reenlightenment_count;
>  	unsigned int hyperv_stimer0_count;
>  #endif
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +	unsigned int irq_tdx_event_notify_count;
> +#endif
>  } ____cacheline_aligned irq_cpustat_t;
>  
>  DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 72184b0b2219..655086dd940e 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_xen_hvm_callback);
>  DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_kvm_asyncpf_interrupt);
>  #endif
>  
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR,	sysvec_tdx_event_notify);
> +#endif
> +
>  #undef X86_TRAP_OTHER
>  
>  #endif
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 43dcb9284208..82ac0c0a34b1 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -104,7 +104,12 @@
>  #define HYPERV_STIMER0_VECTOR		0xed
>  #endif
>  
> -#define LOCAL_TIMER_VECTOR		0xec
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +/* Vector on which TDX Guest event notification is delivered */
> +#define TDX_GUEST_EVENT_NOTIFY_VECTOR	0xec
> +#endif
> +
> +#define LOCAL_TIMER_VECTOR		0xeb
>  
>  #define NR_VECTORS			 256
>  
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 020c81a7c729..eb4db837cc44 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -67,6 +67,10 @@ void tdx_safe_halt(void);
>  
>  bool tdx_early_handle_ve(struct pt_regs *regs);
>  
> +void tdx_setup_ev_notify_handler(void (*handler)(void));
> +
> +void tdx_remove_ev_notify_handler(void);
> +
>  #else
>  
>  static inline void tdx_early_init(void) { };
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 766ffe3ba313..a96ecd866723 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
>  		seq_printf(p, "%10u ",
>  			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
>  	seq_puts(p, "  Posted-interrupt wakeup event\n");
> +#endif
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +	seq_printf(p, "%*s: ", prec, "TGN");
> +	for_each_online_cpu(j)
> +		seq_printf(p, "%10u ",
> +			   irq_stats(j)->irq_tdx_event_notify_count);
> +	seq_puts(p, "  TDX Guest event notification\n");
>  #endif
>  	return 0;
>  }
> -- 
> 2.25.1
> 

Acked-by: Wander Lairson Costa <wander@redhat.com>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-01 18:35 ` [PATCH v5 3/3] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
  2022-05-02  2:40   ` Kai Huang
  2022-05-02  5:01   ` Kai Huang
@ 2022-05-02 13:16   ` Wander Lairson Costa
  2022-05-02 16:05     ` Sathyanarayanan Kuppuswamy
  2 siblings, 1 reply; 56+ messages in thread
From: Wander Lairson Costa @ 2022-05-02 13:16 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Sun, May 01, 2022 at 11:35:00AM -0700, Kuppuswamy Sathyanarayanan wrote:

[snip]

> +
> +static long tdx_get_quote(void __user *argp)
> +{
> +	struct tdx_quote_req quote_req;
> +	long ret = 0;
> +	int order;
> +
> +	/* Hold lock to serialize GetQuote requests */
> +	mutex_lock(&quote_lock);
> +
> +	reinit_completion(&req_compl);
> +
> +	/* Copy GetQuote request struct from user buffer */
> +	if (copy_from_user(&quote_req, argp, sizeof(struct tdx_quote_req))) {
> +		ret = -EFAULT;
> +		goto quote_failed;
> +	}
> +
> +	/* Make sure the length & timeout is valid */
> +	if (!quote_req.len || !quote_req.timeout) {
> +		ret = -EINVAL;
> +		goto quote_failed;
> +	}
> +
> +	/* Get order for Quote buffer page allocation */
> +	order = get_order(quote_req.len);
> +
> +	/*
> +	 * Allocate buffer to get TD Quote from the VMM.
> +	 * Size needs to be 4KB aligned (which is already
> +	 * met in page allocation).
> +	 */
> +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> +	if (!tdquote) {
> +		ret = -ENOMEM;
> +		goto quote_failed;
> +	}
> +
> +	/*
> +	 * Since this buffer will be shared with the VMM via GetQuote
> +	 * hypercall, decrypt it.
> +	 */
> +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> +	if (ret)
> +		goto quote_failed;
> +
> +	/* Copy TDREPORT from user buffer to kernel Quote buffer */
> +	if (copy_from_user(tdquote, (void __user *)quote_req.buf, quote_req.len)) {
> +		ret = -EFAULT;
> +		goto quote_failed;
> +	}
> +
> +	/* Submit GetQuote Request */
> +	ret = tdx_get_quote_hypercall(tdquote, (1ULL << order) * PAGE_SIZE);
> +	if (ret) {
> +		pr_err("GetQuote hypercall failed, status:%lx\n", ret);
> +		ret = -EIO;
> +		goto quote_failed;
> +	}
> +
> +	/* Wait for attestation completion */
> +	ret = wait_for_completion_interruptible(&req_compl);
> +	if (ret <= 0) {
> +		ret = -EIO;
> +		goto quote_failed;
> +	}
> +
> +	/* Copy output data back to user buffer */
> +	if (copy_to_user((void __user *)quote_req.buf, tdquote, quote_req.len))
> +		ret = -EFAULT;
> +

IIUC, ret has a positive value at this point, due to the call to
wait_for_completion_interruptible, so we need add "ret = 0;"
here, don't we?

[snip]


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-02  2:31   ` Kai Huang
@ 2022-05-02 15:52     ` Sathyanarayanan Kuppuswamy
  2022-05-02 22:30       ` Kai Huang
  0 siblings, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-02 15:52 UTC (permalink / raw)
  To: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi Kai,

On 5/1/22 7:31 PM, Kai Huang wrote:
> On Sun, 2022-05-01 at 11:34 -0700, Kuppuswamy Sathyanarayanan wrote:
>> In TDX guest, attestation is used to verify the trustworthiness of a TD
>> to other entities before provisioning secrets to the TD.
>>     
>> One usage example is, when a TD guest uses encrypted drive and if the
>> decryption keys required to access the drive are stored in a secure 3rd
>> party keyserver, the key server can use attestation to verify TD's
>> trustworthiness and release the decryption keys to the TD.
>>
>> The attestation process consists of two steps: TDREPORT generation and
>> Quote generation.
>>
>> TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
>> the TDX module which contains TD-specific information (such as TD
>> measurements), platform security version, and the MAC to protect the
>> integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
>> get the TDREPORT from the TDX module. A user-provided 64-Byte
>> REPORTDATA is used as input and included in the TDREPORT. Typically it
>> can be some nonce provided by attestation service so the TDREPORT can
>> be verified uniquely. More details about TDREPORT can be found in
>> Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".
>>
>> After getting the TDREPORT, the second step of the attestation process
>> is to send the TDREPORT to Quoting Enclave (QE) or Quote Generation
>> Service (QGS) to generate the Quote. However the method of sending the
>> TDREPORT to QE/QGS, communication channel used and data format used is
>> specific to the implementation of QE/QGS.
>>
>> A typical implementation is, TD userspace attestation software gets the
>> TDREPORT from TD kernel, sends it to QE/QGS, and QE/QGS returns the
>> Quote. TD attestation software can use any available communication
>> channel to talk to QE/QGS, such as using vsock and tcp/ip.
>>
>> To support the case that those communication channels are not directly
>> available to the TD, TDX also defines TDVMCALLs to allow TD to ask VMM
>> to help with sending the TDREPORT and receiving the Quote. This support
>> is documented in the GHCI spec section titled "5.4 TD attestation".
> 
> Sorry I didn't think thoroughly in the reply to v4.  I think it's still
> necessary to mention TDREPORT can only be verified locally, because otherwise
> Quote isn't conceptually needed.  And above 3 paragraphs are too verbose I
> guess.  How about below:
> 
> "
> TDREPORT can only be verified locally as the MAC key is bound to the platform.
> TDX attestation leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT
> locally and convert it to a remote verifiable Quote to support remote
> attestation of the TDREPORT.
> 
> TD userspace attestation software firstly gets the TDREPORT from the TD kernel,
> and then sends it to the QE to generate a TD Quote.  The QE sends the Quote back
> after it is generated.  How is the data sent and received is QE implementation
> and deployment specific.  TD userspace attestation software can use whatever
> communication channel available (i.e. tcp/ip, vsock) to communicate with the QE
> using whatever data format.  TDX also defines TDVMCALLs to allow TD to ask VMM
> to facilitate sending/receiving data between TD attestation software and the QE.
> This support is documented in GHCI 1.0 spec "5.4 TD attestation".
> 
> Implement a basic attestation driver to ...

Final version looks like below:

In TDX guest, attestation is used to verify the trustworthiness of a TD
to other entities before provisioning secrets to the TD.

One usage example is, when a TD guest uses encrypted drive and if the
decryption keys required to access the drive are stored in a secure 3rd
party keyserver, the key server can use attestation to verify TD's
trustworthiness and release the decryption keys to the TD.

The attestation process consists of two steps: TDREPORT generation and
Quote generation.

TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
the TDX module which contains TD-specific information (such as TD
measurements), platform security version, and the MAC to protect the
integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
get the TDREPORT from the TDX module. A user-provided 64-Byte
REPORTDATA is used as input and included in the TDREPORT. Typically it
can be some nonce provided by attestation service so the TDREPORT can
be verified uniquely. More details about TDREPORT can be found in
Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".

Also note that the MAC added to the TDREPORT is bound to the platform.
So TDREPORT can only be verified locally. Intel SGX Quote Enclave (QE)
can be leveraged to verify the TDREPORT locally and convert it to a
remote verifiable Quote to support remote attestation of the TDREPORT.

After getting the TDREPORT, the second step of the attestation process
is to send it to QE or Quote Generation Service (QGS) to generate a TD
Quote. The QE sends the Quote back after it is generated. How is the
data sent and received is QE implementation and deployment specific.TD
userspace attestation software can use whatever communication channel
available (i.e. tcp/ip, vsock) to communicate with the QE using whatever
data format. TDX also defines TDVMCALLs to allow TD to ask VMM to
facilitate sending/receiving data between TD attestation software and
the QE. This support is documented in GHCI 1.0 spec "5.4 TD attestation".

Implement a basic attestation driver to allow TD userspace to get the
TDREPORT, which is sent to QE by the attestation software to generate
a Quote for remote verification.

Also note that explicit access permissions are not enforced in this
driver because the quote and measurements are not a secret. However
the access permissions of the device node can be used to set any
desired access policy. The udev default is usually root access
only.

Operations like getting TDREPORT or Quote generation involves sending
a blob of data as input and getting another blob of data as output. It
was considered to use a sysfs interface for this, but it doesn't fit
well into the standard sysfs model for configuring values. It would be
possible to do read/write on files, but it would need multiple file
descriptors, which would be somewhat messy. IOCTLs seems to be the best
fitting and simplest model here.

> "
> 
>>
>> Implement a basic attestation driver to allow TD userspace to get the
>> TDREPORT, which is sent to QE by the attestation software to generate
>> a Quote for remote verification.
>>
>> Also note that explicit access permissions are not enforced in this
>> driver because the quote and measurements are not a secret. However
>> the access permissions of the device node can be used to set any
>> desired access policy. The udev default is usually root access
>> only.
> 
> The IOCTL vs /sysfs isn't discussed.

My previous versions had it. But I removed it in later revisions
because I thought the commit log is alredy long and IOCTL interface
justification is less important compared to other details.

> 
> For instance, after rough thinking, why is the IOCTL better than below approach
> using /sysfs?
> 
> echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
> cat /sys/kernel/coco/tdx/attest/tdreport
> 
> Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver to call
> TDCALL to get the TDREPORT, which is available at '/sys/.../tdreport'.
> 
> The benefit of using IOCTL I can think of now is it is perhaps more secure, as
> with IOCTL the REPORTDATA and the TDREPORT is visible to the process which calls
> the IOCTL, while using the /sysfs they are potentially visible to any process.
> Especially the REPORTDATA, i.e. it can come from attestation service after the
> TD attestation agent sets up a secure connection with it.
> 
> Anyway, my 2cents above.

IMO, since TDREPORT is not a secret we don't have to hightlight security
concern here. How about following?

Operations like getting TDREPORT or Quote generation involves sending
a blob of data as input and getting another blob of data as output. It
was considered to use a sysfs interface for this, but it doesn't fit
well into the standard sysfs model for configuring values. It would be
possible to do read/write on files, but it would need multiple file
descriptors, which would be somewhat messy. IOCTLs seems to be the best
fitting and simplest model here.



> 
>>
>> Reviewed-by: Tony Luck <tony.luck@intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> ---
>>   arch/x86/coco/tdx/Makefile      |   2 +-
>>   arch/x86/coco/tdx/attest.c      | 138 ++++++++++++++++++++++++++++++++
>>   arch/x86/include/uapi/asm/tdx.h |  36 +++++++++
>>   3 files changed, 175 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/x86/coco/tdx/attest.c
>>   create mode 100644 arch/x86/include/uapi/asm/tdx.h
>>
>> diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
>> index 46c55998557d..d2db3e6770e5 100644
>> --- a/arch/x86/coco/tdx/Makefile
>> +++ b/arch/x86/coco/tdx/Makefile
>> @@ -1,3 +1,3 @@
>>   # SPDX-License-Identifier: GPL-2.0
>>   
>> -obj-y += tdx.o tdcall.o
>> +obj-y += tdx.o tdcall.o attest.o
>> diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
>> new file mode 100644
>> index 000000000000..4543a0264ce7
>> --- /dev/null
>> +++ b/arch/x86/coco/tdx/attest.c
>> @@ -0,0 +1,138 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * attest.c - TDX guest attestation interface driver.
>> + *
>> + * Implements user interface to trigger attestation process.
>> + *
>> + * Copyright (C) 2022 Intel Corporation
>> + *
>> + */
>> +
>> +#define pr_fmt(fmt) "x86/tdx: attest: " fmt
>> +
>> +#include <linux/module.h>
> 
> Still need this?
> 
>> +#include <linux/miscdevice.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/fs.h>
>> +#include <linux/mm.h>
>> +#include <linux/slab.h>
>> +#include <linux/set_memory.h>
>> +#include <linux/dma-mapping.h>
> 
> Still need above two?
> 
>> +#include <linux/jiffies.h>
> 
> Don't need this either.
> 
>> +#include <linux/io.h>
>> +#include <asm/apic.h>
>> +#include <asm/tdx.h>
>> +#include <asm/irq_vectors.h>
>> +#include <uapi/asm/tdx.h>
> 
> Please get rid of unnecessary headers.

Ok. I will remove unused headers.

> 
>> +
>> +#define DRIVER_NAME "tdx-attest"
>> +
>> +/* TDREPORT module call leaf ID */
>> +#define TDX_GET_REPORT			4
>> +
>> +/* Upper 32 bits has the status code, so mask it */
>> +#define TDCALL_STATUS_CODE_MASK		0xffffffff00000000
>> +#define TDCALL_STATUS_CODE(a)		((a) & TDCALL_STATUS_CODE_MASK)
>> +
>> +static struct miscdevice miscdev;
>> +
>> +static long tdx_get_report(void __user *argp)
>> +{
>> +	void *reportdata = NULL, *tdreport = NULL;
>> +	long ret = 0;
>> +
>> +	/* Allocate buffer space for REPORTDATA */
>> +	reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
>> +	if (!reportdata)
>> +		return -ENOMEM;
>> +
>> +	/* Allocate buffer space for TDREPORT */
>> +	tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
>> +	if (!tdreport) {
>> +		ret = -ENOMEM;
>> +		goto failed;
> 
> Perhaps 'failed' -> 'out'.  That code is for both error and non-error case.

Ok.

> 
>> +	}
>> +
>> +	/* Copy REPORTDATA from the user buffer */
>> +	if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
>> +		ret = -EFAULT;
>> +		goto failed;
>> +	}
>> +
>> +	/*
>> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>> +	 *
>> +	 * Pass the physical address of user generated REPORTDATA
>> +	 * and the physical address of the output buffer to the TDX
>> +	 * module to generate the TDREPORT. Generated data contains
>> +	 * measurements/configuration data of the TD guest. More info
>> +	 * about ABI can be found in TDX 1.0 Module specification, sec
>> +	 * titled "TDG.MR.REPORT".
> 
> I guess you can get rid of the entire second paragraph.  If the reference to the
> spec is useful, then keep it but other sentences are not quite useful.  Perhaps:
> 
> 	Get the TDREPORT using REPORTDATA as input.  Refer to 22.3.3
> 	TDG.MR.REPORT leaf in the TDX Module 1.0 Specification for detail
> 	information.

How about following?

Pass REPORTDATA as input and generate TDREPORT using "TDG.MR.REPORT"
TDCALL. Refer to 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
Specification for detailed information.

> 
>> +	 */
>> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>> +				virt_to_phys(reportdata), 0, 0, NULL);
>> +	if (ret) {
>> +		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
>> +				TDCALL_STATUS_CODE(ret));
> 
> You can just print out the exact error code.  It's more informative and can help
> to debug.

As per spec, only upper 32 bits has status code. 0:32 does not have any
useful info.

> 
>> +		ret = -EIO;
>> +		goto failed;
>> +	}
>> +
>> +	/* Copy TDREPORT back to the user buffer */
>> +	if (copy_to_user(argp, tdreport, TDX_REPORT_LEN))
>> +		ret = -EFAULT;
>> +
>> +failed:
>> +	kfree(reportdata);
>> +	kfree(tdreport);
>> +	return ret;
>> +}
>> +
>> +static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
>> +			     unsigned long arg)
>> +{
>> +	void __user *argp = (void __user *)arg;
>> +	long ret = 0;
> 
> If you initialize ret to -EINVAL here, then ...

Ok. I will initialize it to -EINVAL.

>> +
>> +	switch (cmd) {
>> +	case TDX_CMD_GET_REPORT:
>> +		ret = tdx_get_report(argp);
>> +		break;
>> +	default:
>> +		pr_debug("cmd %d not supported\n", cmd);
>> +		ret = -EINVAL;
> 
> You don't have to set it here.

Ok.

> 
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static const struct file_operations tdx_attest_fops = {
>> +	.owner		= THIS_MODULE,
>> +	.unlocked_ioctl	= tdx_attest_ioctl,
>> +	.llseek		= no_llseek,
>> +};
>> +
>> +static int __init tdx_attestation_init(void)
>> +{
>> +	long ret;
>> +
>> +	/* Make sure we are in a valid TDX platform */
>> +	if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>> +		return -EIO;
>> +
>> +	miscdev.name = DRIVER_NAME;
>> +	miscdev.minor = MISC_DYNAMIC_MINOR;
>> +	miscdev.fops = &tdx_attest_fops;
>> +
>> +	ret = misc_register(&miscdev);
>> +	if (ret) {
>> +		pr_err("misc device registration failed\n");
> 
> pr_debug() is used when __tdx_module_call() fails, and in the default case in
> tdx_attest_ioctl() too.
> 
> Shouldn't those error msg be printed using the same way?

For IOCTL case, I expect userspace to print the error. But for init
code error, it needs to be handled by kernel. So I have used pr_err
here.

> 
>> +		return ret;
>> +	}
>> +
>> +	pr_debug("module initialization success\n");
> 
> I don't think it's a module anymore?

Agree. I will remove the keyword module here.

> 
> Also perhaps just pr_info()?

Misc device creation itself is sufficient proof of successful
loading. So I will remove this debug message completely.

> 
>> +
>> +	return 0;
>> +}
>> +device_initcall(tdx_attestation_init)
>> diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
>> new file mode 100644
>> index 000000000000..9a7377723667
>> --- /dev/null
>> +++ b/arch/x86/include/uapi/asm/tdx.h
>> @@ -0,0 +1,36 @@
>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>> +#ifndef _UAPI_ASM_X86_TDX_H
>> +#define _UAPI_ASM_X86_TDX_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/ioctl.h>
>> +
>> +/* Length of the REPORTDATA used in TDG.MR.REPORT TDCALL */
>> +#define TDX_REPORTDATA_LEN		64
>> +
>> +/* Length of TDREPORT used in TDG.MR.REPORT TDCALL */
>> +#define TDX_REPORT_LEN			1024
> 
> I prefer TDX_TDREPORT_LEN.
> 
>> +
>> +/**
>> + * struct tdx_report_req: Get TDREPORT from the TDX module.
> 
> Just get the TDREPORT is enough I guess.

Get TDREPORT using REPORTDATA as input?

> 
>> + *
>> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
>> + *		 TDREPORT. Typically it can be some nonce provided by
>> + *		 attestation software so the generated TDREPORT can be
>> + *		 uniquely verified.
>> + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
>> + *		 TDX_REPORT_LEN.
>> + *
>> + * Used in TDX_CMD_GET_REPORT IOCTL request.
>> + */
>> +struct tdx_report_req {
>> +	union {
>> +		__u8 reportdata[TDX_REPORTDATA_LEN];
>> +		__u8 tdreport[TDX_REPORT_LEN];
>> +	};
>> +};
> 
> I am not sure overriding the input is a good idea, but will leave to others.

TDCALL uses it that way. So I have followed the same model.

> 
>> +
>> +/* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */
> 
> Just get the TDREPORT is enough I guess.

May be following?

Get TDREPORT using TDCALL[TDG.MR.REPORT)

> 
>> +#define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
>> +
>> +#endif /* _UAPI_ASM_X86_TDX_H */
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-02  5:01   ` Kai Huang
@ 2022-05-02 16:02     ` Sathyanarayanan Kuppuswamy
  2022-05-02 23:02       ` Kai Huang
  0 siblings, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-02 16:02 UTC (permalink / raw)
  To: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi Kai,

On 5/1/22 10:01 PM, Kai Huang wrote:
> 
>> +
>> +static long tdx_get_quote(void __user *argp)
>> +{
>> +	struct tdx_quote_req quote_req;
>> +	long ret = 0;
>> +	int order;
>> +
>> +	/* Hold lock to serialize GetQuote requests */
>> +	mutex_lock(&quote_lock);
>> +
>> +	reinit_completion(&req_compl);
>> +
>> +	/* Copy GetQuote request struct from user buffer */
>> +	if (copy_from_user(&quote_req, argp, sizeof(struct tdx_quote_req))) {
>> +		ret = -EFAULT;
>> +		goto quote_failed;
>> +	}
>> +
>> +	/* Make sure the length & timeout is valid */
>> +	if (!quote_req.len || !quote_req.timeout) {
>> +		ret = -EINVAL;
>> +		goto quote_failed;
>> +	}
>> +
>> +	/* Get order for Quote buffer page allocation */
>> +	order = get_order(quote_req.len);
>> +
>> +	/*
>> +	 * Allocate buffer to get TD Quote from the VMM.
>> +	 * Size needs to be 4KB aligned (which is already
>> +	 * met in page allocation).
>> +	 */
>> +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>> +	if (!tdquote) {
>> +		ret = -ENOMEM;
>> +		goto quote_failed;
>> +	}
>> +
>> +	/*
>> +	 * Since this buffer will be shared with the VMM via GetQuote
>> +	 * hypercall, decrypt it.
>> +	 */
>> +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
>> +	if (ret)
>> +		goto quote_failed;
>> +
>> +	/* Copy TDREPORT from user buffer to kernel Quote buffer */
>> +	if (copy_from_user(tdquote, (void __user *)quote_req.buf, quote_req.len)) {
>> +		ret = -EFAULT;
>> +		goto quote_failed;
>> +	}
>> +
>> +	/* Submit GetQuote Request */
>> +	ret = tdx_get_quote_hypercall(tdquote, (1ULL << order) * PAGE_SIZE);
>> +	if (ret) {
>> +		pr_err("GetQuote hypercall failed, status:%lx\n", ret);
>> +		ret = -EIO;
>> +		goto quote_failed;
>> +	}
>> +
>> +	/* Wait for attestation completion */
>> +	ret = wait_for_completion_interruptible(&req_compl);
>> +	if (ret <= 0) {
>> +		ret = -EIO;
>> +		goto quote_failed;
>> +	}
>> +
>> +	/* Copy output data back to user buffer */
>> +	if (copy_to_user((void __user *)quote_req.buf, tdquote, quote_req.len))
>> +		ret = -EFAULT;
>> +
>> +quote_failed:
>> +	if (tdquote)
>> +		free_pages((unsigned long)tdquote, order);
> 
> The buffer is freed w/o being converted back to private.  How can you prevent
> the buffer from being allocated by kernel and used as private pages again?

Yes. It needs set_memory_encrypted() call here. I will fix this in next
version.

> 
> Also, the  buffer may be still used by VMM when timeout (IN_FLIGHT), how can
> this even work?

We will never reach here for IN_FLIGHT case. We will block in
wait_for_completion_interruptible() till the status changes to success
or failure.

> 
>> +	tdquote = NULL;
>> +	mutex_unlock(&quote_lock);
>> +	return ret;
>> +}
>> +
>> +static void attestation_callback_handler(void)
>> +{
>> +	struct tdx_quote_hdr *quote_hdr;
>> +
>> +	quote_hdr = (struct tdx_quote_hdr *) tdquote;
>> +
>> +	/* Check for spurious callback IRQ case */
>> +	if (!tdquote || quote_hdr->status == GET_QUOTE_IN_FLIGHT)
>> +		return;
> 
> I don't get the logic.  Please explain.

I am trying to handle spurious IRQ case here. If we receive a callback
IRQ from VMM before even we allocate tdquote or post the GetQuote
request, accessing quote_hdr->status will lead to NULL pointer
exception. So I have added check for valid quote buffer (tdquote !=
NULL)

Second condition (quote_hdr->status == GET_QUOTE_IN_FLIGHT)) makes
sure we don't mark the current quote request complete until the
Quote buffer status changes to GET_QUOTE_SUCCESS, GET_QUOTE_ERROR or
GET_QUOTE_SERVICE_UNAVAILABLE.

> 
>> +
>> +	complete(&req_compl);
>> +}
>> +
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-02 13:16   ` Wander Lairson Costa
@ 2022-05-02 16:05     ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-02 16:05 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/2/22 6:16 AM, Wander Lairson Costa wrote:
> IIUC, ret has a positive value at this point, due to the call to
> wait_for_completion_interruptible, so we need add "ret = 0;"
> here, don't we?

It looks wait_for_completion_interruptible() return 0 on successful
completion. So I think I need to change ret <=0 to ret < 0.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-02 12:18   ` Wander Lairson Costa
@ 2022-05-02 16:06     ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-02 16:06 UTC (permalink / raw)
  To: Wander Lairson Costa
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Kai Huang, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/2/22 5:18 AM, Wander Lairson Costa wrote:
>> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>> +				virt_to_phys(reportdata), 0, 0, NULL);
>> +	if (ret) {
>> +		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
>> +				TDCALL_STATUS_CODE(ret));
> Should we use pr_err instead?

I expect user app will handle this error case and print debug info.
So we don't need to use pr_err.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-02 15:52     ` Sathyanarayanan Kuppuswamy
@ 2022-05-02 22:30       ` Kai Huang
  2022-05-02 23:17         ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-02 22:30 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel


> 
> Also note that the MAC added to the TDREPORT is bound to the platform.
> So TDREPORT can only be verified locally. Intel SGX Quote Enclave (QE)
> can be leveraged to verify the TDREPORT locally and convert it to a
> remote verifiable Quote to support remote attestation of the TDREPORT.

Why "can be"?  TDX must use QE to generate the Quote.

[...]

> 
> > 
> > For instance, after rough thinking, why is the IOCTL better than below approach
> > using /sysfs?
> > 
> > echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
> > cat /sys/kernel/coco/tdx/attest/tdreport
> > 
> > Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver to call
> > TDCALL to get the TDREPORT, which is available at '/sys/.../tdreport'.
> > 
> > The benefit of using IOCTL I can think of now is it is perhaps more secure, as
> > with IOCTL the REPORTDATA and the TDREPORT is visible to the process which calls
> > the IOCTL, while using the /sysfs they are potentially visible to any process.
> > Especially the REPORTDATA, i.e. it can come from attestation service after the
> > TD attestation agent sets up a secure connection with it.
> > 
> > Anyway, my 2cents above.
> 
> IMO, since TDREPORT is not a secret we don't have to hightlight security
> concern here. 
> 

Right the TDREPORT itself isn't a secret.  However my thinking is the REPORTDATA
might be.  It's typically provided by the attestation service to the TD so the
Quote can be verified for instance per-session or per-connect or whatever.  The
REPORTDATA is the only thing that can be used to prevent reply attack anyway. 
From this perspective, it is kinda secret.  However the TDREPORT can be captured
by untrusted software and used for reply attack if no crypto-protection is used
when it is sent to the QE, so I am not sure how bad can the reply attack cause.

> How about following?
> 
> Operations like getting TDREPORT or Quote generation involves sending
> a blob of data as input and getting another blob of data as output. It
> was considered to use a sysfs interface for this, but it doesn't fit
> well into the standard sysfs model for configuring values. It would be
> possible to do read/write on files, but it would need multiple file
> descriptors, which would be somewhat messy. IOCTLs seems to be the best
> fitting and simplest model here.
> 
> 

Let's forget about GetQuote now.  As you can see it has couple of problems.  

If we don't argue from security perspective, what's wrong with the approach
using /sysfs I mentioned above?

[...]

> > > +
> > > +	/*
> > > +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
> > > +	 *
> > > +	 * Pass the physical address of user generated REPORTDATA
> > > +	 * and the physical address of the output buffer to the TDX
> > > +	 * module to generate the TDREPORT. Generated data contains
> > > +	 * measurements/configuration data of the TD guest. More info
> > > +	 * about ABI can be found in TDX 1.0 Module specification, sec
> > > +	 * titled "TDG.MR.REPORT".
> > 
> > I guess you can get rid of the entire second paragraph.  If the reference to the
> > spec is useful, then keep it but other sentences are not quite useful.  Perhaps:
> > 
> > 	Get the TDREPORT using REPORTDATA as input.  Refer to 22.3.3
> > 	TDG.MR.REPORT leaf in the TDX Module 1.0 Specification for detail
> > 	information.
> 
> How about following?
> 
> Pass REPORTDATA as input and generate TDREPORT using "TDG.MR.REPORT"
> TDCALL. Refer to 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
> Specification for detailed information.

No problem.

> 
> > 
> > > +	 */
> > > +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
> > > +				virt_to_phys(reportdata), 0, 0, NULL);
> > > +	if (ret) {
> > > +		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
> > > +				TDCALL_STATUS_CODE(ret));
> > 
> > You can just print out the exact error code.  It's more informative and can help
> > to debug.
> 
> As per spec, only upper 32 bits has status code. 0:32 does not have any
> useful info.

Bits 0:31 are also defined by TDX error codes.  For instance, it also prints
which argument caused this error in case of OPERAND_INVALID.  Why is it not
useful?

[...]

> > > +	ret = misc_register(&miscdev);
> > > +	if (ret) {
> > > +		pr_err("misc device registration failed\n");
> > 
> > pr_debug() is used when __tdx_module_call() fails, and in the default case in
> > tdx_attest_ioctl() too.
> > 
> > Shouldn't those error msg be printed using the same way?
> 
> For IOCTL case, I expect userspace to print the error. But for init
> code error, it needs to be handled by kernel. So I have used pr_err
> here.

I don't quite get.  Why "userspace will print the error or not" has anything to
do with using pr_debug() vs pr_err() here?

[...]

> 
> > > +
> > > +/**
> > > + * struct tdx_report_req: Get TDREPORT from the TDX module.
> > 
> > Just get the TDREPORT is enough I guess.
> 
> Get TDREPORT using REPORTDATA as input?

No problem.

> 
> > 
> > > + *
> > > + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
> > > + *		 TDREPORT. Typically it can be some nonce provided by
> > > + *		 attestation software so the generated TDREPORT can be
> > > + *		 uniquely verified.
> > > + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
> > > + *		 TDX_REPORT_LEN.
> > > + *
> > > + * Used in TDX_CMD_GET_REPORT IOCTL request.
> > > + */
> > > +struct tdx_report_req {
> > > +	union {
> > > +		__u8 reportdata[TDX_REPORTDATA_LEN];
> > > +		__u8 tdreport[TDX_REPORT_LEN];
> > > +	};
> > > +};
> > 
> > I am not sure overriding the input is a good idea, but will leave to others.
> 
> TDCALL uses it that way. So I have followed the same model.

Which TDCALL?

And TDCALL is kernel internal implementation, but we are talking about userspace
ABI here.  I don't see any connection between them.

> 
> > 
> > > +
> > > +/* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */
> > 
> > Just get the TDREPORT is enough I guess.
> 
> May be following?
> 
> Get TDREPORT using TDCALL[TDG.MR.REPORT)

My thinking is you don't need to call out the exact TDCALL in the uapi header. 
But no opinion here.  Will leave to maintainers.

> 
> > 
> > > +#define TDX_CMD_GET_REPORT		_IOWR('T', 0x01, struct tdx_report_req)
> > > +
> > > +#endif /* _UAPI_ASM_X86_TDX_H */
> > 
> 


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-02 16:02     ` Sathyanarayanan Kuppuswamy
@ 2022-05-02 23:02       ` Kai Huang
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-02 23:02 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

> 
> > 
> > Also, the  buffer may be still used by VMM when timeout (IN_FLIGHT), how can
> > this even work?
> 
> We will never reach here for IN_FLIGHT case. We will block in
> wait_for_completion_interruptible() till the status changes to success
> or failure.

But you still have 'timeout' in userspace ABI?

> 
> > 
> > > +	tdquote = NULL;
> > > +	mutex_unlock(&quote_lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static void attestation_callback_handler(void)
> > > +{
> > > +	struct tdx_quote_hdr *quote_hdr;
> > > +
> > > +	quote_hdr = (struct tdx_quote_hdr *) tdquote;
> > > +
> > > +	/* Check for spurious callback IRQ case */
> > > +	if (!tdquote || quote_hdr->status == GET_QUOTE_IN_FLIGHT)
> > > +		return;
> > 
> > I don't get the logic.  Please explain.
> 
> I am trying to handle spurious IRQ case here. If we receive a callback
> IRQ from VMM before even we allocate tdquote or post the GetQuote
> request, accessing quote_hdr->status will lead to NULL pointer
> exception. So I have added check for valid quote buffer (tdquote !=
> NULL)

tdquote here isn't protected mutex, so even after you check tdquote != NULL, you
cannot guarantee it is still valid when you check quote_hdr. 

For instance, you receive two interrupts in very short time.  The first
interrupt gets you out from wait_for_completion_interruptible().  Then second
interrupt comes, and you can potentially have situation like below:

	cpu 0					cpu 1

	tdx_get_quote				attestation_callback_hander
						
						!tdquote
		...
		tdquote = NULL;

						quote_hdr->status == ...


And the 'quote_hdr' is even initialized at the beginning, and when it is used,
the actual tdquote may already been freed.

Or did I get it wrong?

> 
> Second condition (quote_hdr->status == GET_QUOTE_IN_FLIGHT)) makes
> sure we don't mark the current quote request complete until the
> Quote buffer status changes to GET_QUOTE_SUCCESS, GET_QUOTE_ERROR or
> GET_QUOTE_SERVICE_UNAVAILABLE.

I thought the GHCI spec says VMM will always inject intererupt after the Quote
is ready, so we can have assumption it won't inject when buffer is still
IN_FLIGHT.  But I am also not so sure.


Anyway, to me it seems there's ambiguities around GetQuote and
SetupEventNotifyInterrupt.  This is the reason that I suggested  you to split
the basic driver out and you can even upstream it first w/o GetQuote support.  I
think you may want to actually start to consider to upstream the basic driver
first.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-02 22:30       ` Kai Huang
@ 2022-05-02 23:17         ` Sathyanarayanan Kuppuswamy
  2022-05-02 23:37           ` Kai Huang
  2022-05-03 14:38           ` Wander Costa
  0 siblings, 2 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-02 23:17 UTC (permalink / raw)
  To: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/2/22 3:30 PM, Kai Huang wrote:
> 
>>
>> Also note that the MAC added to the TDREPORT is bound to the platform.
>> So TDREPORT can only be verified locally. Intel SGX Quote Enclave (QE)
>> can be leveraged to verify the TDREPORT locally and convert it to a
>> remote verifiable Quote to support remote attestation of the TDREPORT.
> 
> Why "can be"?  TDX must use QE to generate the Quote.
> 
> [...]

Ok. Will change it.

> 
>>
>>>
>>> For instance, after rough thinking, why is the IOCTL better than below approach
>>> using /sysfs?
>>>
>>> echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
>>> cat /sys/kernel/coco/tdx/attest/tdreport
>>>
>>> Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver to call
>>> TDCALL to get the TDREPORT, which is available at '/sys/.../tdreport'.
>>>
>>> The benefit of using IOCTL I can think of now is it is perhaps more secure, as
>>> with IOCTL the REPORTDATA and the TDREPORT is visible to the process which calls
>>> the IOCTL, while using the /sysfs they are potentially visible to any process.
>>> Especially the REPORTDATA, i.e. it can come from attestation service after the
>>> TD attestation agent sets up a secure connection with it.
>>>
>>> Anyway, my 2cents above.
>>
>> IMO, since TDREPORT is not a secret we don't have to hightlight security
>> concern here.
>>
> 
> Right the TDREPORT itself isn't a secret.  However my thinking is the REPORTDATA
> might be.  It's typically provided by the attestation service to the TD so the
> Quote can be verified for instance per-session or per-connect or whatever.  The
> REPORTDATA is the only thing that can be used to prevent reply attack anyway.
>  From this perspective, it is kinda secret.  However the TDREPORT can be captured
> by untrusted software and used for reply attack if no crypto-protection is used
> when it is sent to the QE, so I am not sure how bad can the reply attack cause.
> 
>> How about following?
>>
>> Operations like getting TDREPORT or Quote generation involves sending
>> a blob of data as input and getting another blob of data as output. It
>> was considered to use a sysfs interface for this, but it doesn't fit
>> well into the standard sysfs model for configuring values. It would be
>> possible to do read/write on files, but it would need multiple file
>> descriptors, which would be somewhat messy. IOCTLs seems to be the best
>> fitting and simplest model here.
>>
>>
> 
> Let's forget about GetQuote now.  As you can see it has couple of problems.
> 
> If we don't argue from security perspective, what's wrong with the approach
> using /sysfs I mentioned above?

Sysfs is generally used for configuring values. As I have mentioned, it
does not fit well for our use case where we want to send blob of data
as input and get another blob as an output. But, if you really want to
use it, you can implement it like you have mentioned (with multiple
files). But I don't see any advantage in doing it.

Also, in the future, if you want to support multiple requests in
parallel, sysfs model will not work.

> 
> [...]
> 
>>>> +
>>>> +	/*
>>>> +	 * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>>>> +	 *
>>>> +	 * Pass the physical address of user generated REPORTDATA
>>>> +	 * and the physical address of the output buffer to the TDX
>>>> +	 * module to generate the TDREPORT. Generated data contains
>>>> +	 * measurements/configuration data of the TD guest. More info
>>>> +	 * about ABI can be found in TDX 1.0 Module specification, sec
>>>> +	 * titled "TDG.MR.REPORT".
>>>
>>> I guess you can get rid of the entire second paragraph.  If the reference to the
>>> spec is useful, then keep it but other sentences are not quite useful.  Perhaps:
>>>
>>> 	Get the TDREPORT using REPORTDATA as input.  Refer to 22.3.3
>>> 	TDG.MR.REPORT leaf in the TDX Module 1.0 Specification for detail
>>> 	information.
>>
>> How about following?
>>
>> Pass REPORTDATA as input and generate TDREPORT using "TDG.MR.REPORT"
>> TDCALL. Refer to 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
>> Specification for detailed information.
> 
> No problem.
> 
>>
>>>
>>>> +	 */
>>>> +	ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>>>> +				virt_to_phys(reportdata), 0, 0, NULL);
>>>> +	if (ret) {
>>>> +		pr_debug("TDREPORT TDCALL failed, status:%lx\n",
>>>> +				TDCALL_STATUS_CODE(ret));
>>>
>>> You can just print out the exact error code.  It's more informative and can help
>>> to debug.
>>
>> As per spec, only upper 32 bits has status code. 0:32 does not have any
>> useful info.
> 
> Bits 0:31 are also defined by TDX error codes.  For instance, it also prints
> which argument caused this error in case of OPERAND_INVALID.  Why is it not
> useful?
> 
> [...]

Agree with your point. I will remove the mask.

> 
>>>> +	ret = misc_register(&miscdev);
>>>> +	if (ret) {
>>>> +		pr_err("misc device registration failed\n");
>>>
>>> pr_debug() is used when __tdx_module_call() fails, and in the default case in
>>> tdx_attest_ioctl() too.
>>>
>>> Shouldn't those error msg be printed using the same way?
>>
>> For IOCTL case, I expect userspace to print the error. But for init
>> code error, it needs to be handled by kernel. So I have used pr_err
>> here.
> 
> I don't quite get.  Why "userspace will print the error or not" has anything to
> do with using pr_debug() vs pr_err() here?
> 
> [...]

I don't want to pollute the dmesg logs if possible. For IOCTL use case, 
the return value can be used to understand the failure reason from user
code. But for initcall failure, pr_err message is required to understand
the failure reason.


> 
>>
>>>> +
>>>> +/**
>>>> + * struct tdx_report_req: Get TDREPORT from the TDX module.
>>>
>>> Just get the TDREPORT is enough I guess.
>>
>> Get TDREPORT using REPORTDATA as input?
> 
> No problem.
> 
>>
>>>
>>>> + *
>>>> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
>>>> + *		 TDREPORT. Typically it can be some nonce provided by
>>>> + *		 attestation software so the generated TDREPORT can be
>>>> + *		 uniquely verified.
>>>> + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
>>>> + *		 TDX_REPORT_LEN.
>>>> + *
>>>> + * Used in TDX_CMD_GET_REPORT IOCTL request.
>>>> + */
>>>> +struct tdx_report_req {
>>>> +	union {
>>>> +		__u8 reportdata[TDX_REPORTDATA_LEN];
>>>> +		__u8 tdreport[TDX_REPORT_LEN];
>>>> +	};
>>>> +};
>>>
>>> I am not sure overriding the input is a good idea, but will leave to others.
>>
>> TDCALL uses it that way. So I have followed the same model.
> 
> Which TDCALL?

TDG.MR.REPORT. It uses the same buffer as input and output.


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-02 23:17         ` Sathyanarayanan Kuppuswamy
@ 2022-05-02 23:37           ` Kai Huang
  2022-05-03 14:38           ` Wander Costa
  1 sibling, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-02 23:37 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

> > > 
> > > > 
> > > > For instance, after rough thinking, why is the IOCTL better than below approach
> > > > using /sysfs?
> > > > 
> > > > echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
> > > > cat /sys/kernel/coco/tdx/attest/tdreport
> > > > 
> > > > Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver to call
> > > > TDCALL to get the TDREPORT, which is available at '/sys/.../tdreport'.
> > > > 
> > > > The benefit of using IOCTL I can think of now is it is perhaps more secure, as
> > > > with IOCTL the REPORTDATA and the TDREPORT is visible to the process which calls
> > > > the IOCTL, while using the /sysfs they are potentially visible to any process.
> > > > Especially the REPORTDATA, i.e. it can come from attestation service after the
> > > > TD attestation agent sets up a secure connection with it.
> > > > 
> > > > Anyway, my 2cents above.
> > > 
> > > IMO, since TDREPORT is not a secret we don't have to hightlight security
> > > concern here.
> > > 
> > 
> > Right the TDREPORT itself isn't a secret.  However my thinking is the REPORTDATA
> > might be.  It's typically provided by the attestation service to the TD so the
> > Quote can be verified for instance per-session or per-connect or whatever.  The
> > REPORTDATA is the only thing that can be used to prevent reply attack anyway.
> >  From this perspective, it is kinda secret.  However the TDREPORT can be captured
> > by untrusted software and used for reply attack if no crypto-protection is used
> > when it is sent to the QE, so I am not sure how bad can the reply attack cause.
> > 
> > > How about following?
> > > 
> > > Operations like getting TDREPORT or Quote generation involves sending
> > > a blob of data as input and getting another blob of data as output. It
> > > was considered to use a sysfs interface for this, but it doesn't fit
> > > well into the standard sysfs model for configuring values. It would be
> > > possible to do read/write on files, but it would need multiple file
> > > descriptors, which would be somewhat messy. IOCTLs seems to be the best
> > > fitting and simplest model here.
> > > 
> > > 
> > 
> > Let's forget about GetQuote now.  As you can see it has couple of problems.
> > 
> > If we don't argue from security perspective, what's wrong with the approach
> > using /sysfs I mentioned above?
> 
> Sysfs is generally used for configuring values. As I have mentioned, it
> does not fit well for our use case where we want to send blob of data
> as input and get another blob as an output. But, if you really want to
> use it, you can implement it like you have mentioned (with multiple
> files). But I don't see any advantage in doing it.

I am not so sure about this argument.  Will leave this to maintainers.

My point is to me from security's perspective, IOCTL fits better.  And you can
add this to your changelog.

> 
> Also, in the future, if you want to support multiple requests in
> parallel, sysfs model will not work.
> 
> 
> > > 

I think you are mixing getting TDREPORT and GetQuote.  What's the point of
supporting get TDREPORT in parallel?  You can only get one TDREPORT from TDX
module at one time.

> > > > 
> > > > > + *
> > > > > + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
> > > > > + *		 TDREPORT. Typically it can be some nonce provided by
> > > > > + *		 attestation software so the generated TDREPORT can be
> > > > > + *		 uniquely verified.
> > > > > + * @tdreport   : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
> > > > > + *		 TDX_REPORT_LEN.
> > > > > + *
> > > > > + * Used in TDX_CMD_GET_REPORT IOCTL request.
> > > > > + */
> > > > > +struct tdx_report_req {
> > > > > +	union {
> > > > > +		__u8 reportdata[TDX_REPORTDATA_LEN];
> > > > > +		__u8 tdreport[TDX_REPORT_LEN];
> > > > > +	};
> > > > > +};
> > > > 
> > > > I am not sure overriding the input is a good idea, but will leave to others.
> > > 
> > > TDCALL uses it that way. So I have followed the same model.
> > 
> > Which TDCALL?
> 
> TDG.MR.REPORT. It uses the same buffer as input and output.
> 
> 
It doesn't override the REPORTDATA:

Input:

RCX:	1024B-aligned guest physical address of newly created report structure
RDX:	64B-aligned guest physical address of additional data to be signed

As you can see the buffer to hold TDREPORT and REPORTDATA are not the same.

Anyway, we are talking about userspace ABI.  I don't see why this is related
here.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-02  2:40   ` Kai Huang
@ 2022-05-03  1:27     ` Kirill A. Shutemov
  2022-05-03  2:18       ` Kai Huang
  2022-05-03 22:24       ` Dave Hansen
  0 siblings, 2 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-03  1:27 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Mon, May 02, 2022 at 02:40:26PM +1200, Kai Huang wrote:
> 
> > +
> > +	/* Get order for Quote buffer page allocation */
> > +	order = get_order(quote_req.len);
> > +
> > +	/*
> > +	 * Allocate buffer to get TD Quote from the VMM.
> > +	 * Size needs to be 4KB aligned (which is already
> > +	 * met in page allocation).
> > +	 */
> > +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> > +	if (!tdquote) {
> > +		ret = -ENOMEM;
> > +		goto quote_failed;
> > +	}
> 
> You can use alloc_pages_exact().
> 
> > +
> > +	/*
> > +	 * Since this buffer will be shared with the VMM via GetQuote
> > +	 * hypercall, decrypt it.
> > +	 */
> > +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> > +	if (ret)
> > +		goto quote_failed;
> 
> 
> Again, Dave and Andi already commented you should use vmap() to avoid breaking
> up the direct-mapping.  Please use vmap() instead.
> 
> https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
> 
> Will review the rest later.

I would rather convert it to use DMA API for memory allocation. It will
tap into swiotlb buffer that already converted and there's no need to
touch direct mapping. Both allocation and freeing such memory is cheaper
because of that.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03  1:27     ` Kirill A. Shutemov
@ 2022-05-03  2:18       ` Kai Huang
  2022-05-03  2:39         ` Sathyanarayanan Kuppuswamy
  2022-05-03  2:45         ` Kirill A. Shutemov
  2022-05-03 22:24       ` Dave Hansen
  1 sibling, 2 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-03  2:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, 2022-05-03 at 04:27 +0300, Kirill A. Shutemov wrote:
> On Mon, May 02, 2022 at 02:40:26PM +1200, Kai Huang wrote:
> > 
> > > +
> > > +	/* Get order for Quote buffer page allocation */
> > > +	order = get_order(quote_req.len);
> > > +
> > > +	/*
> > > +	 * Allocate buffer to get TD Quote from the VMM.
> > > +	 * Size needs to be 4KB aligned (which is already
> > > +	 * met in page allocation).
> > > +	 */
> > > +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> > > +	if (!tdquote) {
> > > +		ret = -ENOMEM;
> > > +		goto quote_failed;
> > > +	}
> > 
> > You can use alloc_pages_exact().
> > 
> > > +
> > > +	/*
> > > +	 * Since this buffer will be shared with the VMM via GetQuote
> > > +	 * hypercall, decrypt it.
> > > +	 */
> > > +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> > > +	if (ret)
> > > +		goto quote_failed;
> > 
> > 
> > Again, Dave and Andi already commented you should use vmap() to avoid breaking
> > up the direct-mapping.  Please use vmap() instead.
> > 
> > https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
> > 
> > Will review the rest later.
> 
> I would rather convert it to use DMA API for memory allocation. It will
> tap into swiotlb buffer that already converted and there's no need to
> touch direct mapping. Both allocation and freeing such memory is cheaper
> because of that.
> 

Does each DMA allocation and free internally do the actual private/shared
conversion?  Or the swiotlb is converted at the beginning at boot and DMA
allocation will always get the shared buffer automatically?

The problem of using DMA API is it will need to bring additional code to use
platform device, which isn't necessary.

Using vmap() we can still (almost) avoid private/shared conversion at IOCTL time
by allocating a default size buffer (which is large enough to cover 99% cases,
etc) at driver initialization time:

https://lore.kernel.org/lkml/20220422233418.1203092-2-sathyanarayanan.kuppuswamy@linux.intel.com/T/#maf7e5f6894548972c5de71f607199a79645856ff


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03  2:18       ` Kai Huang
@ 2022-05-03  2:39         ` Sathyanarayanan Kuppuswamy
  2022-05-03 22:13           ` Kai Huang
  2022-05-03  2:45         ` Kirill A. Shutemov
  1 sibling, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-03  2:39 UTC (permalink / raw)
  To: Kai Huang, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/2/22 7:18 PM, Kai Huang wrote:
> On Tue, 2022-05-03 at 04:27 +0300, Kirill A. Shutemov wrote:
>> On Mon, May 02, 2022 at 02:40:26PM +1200, Kai Huang wrote:
>>>
>>>> +
>>>> +	/* Get order for Quote buffer page allocation */
>>>> +	order = get_order(quote_req.len);
>>>> +
>>>> +	/*
>>>> +	 * Allocate buffer to get TD Quote from the VMM.
>>>> +	 * Size needs to be 4KB aligned (which is already
>>>> +	 * met in page allocation).
>>>> +	 */
>>>> +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
>>>> +	if (!tdquote) {
>>>> +		ret = -ENOMEM;
>>>> +		goto quote_failed;
>>>> +	}
>>>
>>> You can use alloc_pages_exact().
>>>
>>>> +
>>>> +	/*
>>>> +	 * Since this buffer will be shared with the VMM via GetQuote
>>>> +	 * hypercall, decrypt it.
>>>> +	 */
>>>> +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
>>>> +	if (ret)
>>>> +		goto quote_failed;
>>>
>>>
>>> Again, Dave and Andi already commented you should use vmap() to avoid breaking
>>> up the direct-mapping.  Please use vmap() instead.
>>>
>>> https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
>>>
>>> Will review the rest later.
>>
>> I would rather convert it to use DMA API for memory allocation. It will
>> tap into swiotlb buffer that already converted and there's no need to
>> touch direct mapping. Both allocation and freeing such memory is cheaper
>> because of that.
>>
> 
> Does each DMA allocation and free internally do the actual private/shared
> conversion?  Or the swiotlb is converted at the beginning at boot and DMA
> allocation will always get the shared buffer automatically?

DMA allocation will always return shared buffer.

> 
> The problem of using DMA API is it will need to bring additional code to use
> platform device, which isn't necessary.

Yes.

> 
> Using vmap() we can still (almost) avoid private/shared conversion at IOCTL time
> by allocating a default size buffer (which is large enough to cover 99% cases,
> etc) at driver initialization time:

Allocating fixed size buffer pool will work for dma buffer allocation
as well.

So the comparison is between platform driver boilerplate code vs vmap
and shared/unshared code addition. It is arguable which is better. I
think it is about preference.

> 
> https://lore.kernel.org/lkml/20220422233418.1203092-2-sathyanarayanan.kuppuswamy@linux.intel.com/T/#maf7e5f6894548972c5de71f607199a79645856ff
> 
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03  2:18       ` Kai Huang
  2022-05-03  2:39         ` Sathyanarayanan Kuppuswamy
@ 2022-05-03  2:45         ` Kirill A. Shutemov
  2022-05-03  3:36           ` Kai Huang
  1 sibling, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-03  2:45 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, May 03, 2022 at 02:18:10PM +1200, Kai Huang wrote:
> On Tue, 2022-05-03 at 04:27 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 02, 2022 at 02:40:26PM +1200, Kai Huang wrote:
> > > 
> > > > +
> > > > +	/* Get order for Quote buffer page allocation */
> > > > +	order = get_order(quote_req.len);
> > > > +
> > > > +	/*
> > > > +	 * Allocate buffer to get TD Quote from the VMM.
> > > > +	 * Size needs to be 4KB aligned (which is already
> > > > +	 * met in page allocation).
> > > > +	 */
> > > > +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> > > > +	if (!tdquote) {
> > > > +		ret = -ENOMEM;
> > > > +		goto quote_failed;
> > > > +	}
> > > 
> > > You can use alloc_pages_exact().
> > > 
> > > > +
> > > > +	/*
> > > > +	 * Since this buffer will be shared with the VMM via GetQuote
> > > > +	 * hypercall, decrypt it.
> > > > +	 */
> > > > +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> > > > +	if (ret)
> > > > +		goto quote_failed;
> > > 
> > > 
> > > Again, Dave and Andi already commented you should use vmap() to avoid breaking
> > > up the direct-mapping.  Please use vmap() instead.
> > > 
> > > https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
> > > 
> > > Will review the rest later.
> > 
> > I would rather convert it to use DMA API for memory allocation. It will
> > tap into swiotlb buffer that already converted and there's no need to
> > touch direct mapping. Both allocation and freeing such memory is cheaper
> > because of that.
> > 
> 
> Does each DMA allocation and free internally do the actual private/shared
> conversion?  Or the swiotlb is converted at the beginning at boot and DMA
> allocation will always get the shared buffer automatically?

It can remap as fallback, but usually it allocates from the pool.

> The problem of using DMA API is it will need to bring additional code to use
> platform device, which isn't necessary.

Heh? DMA is in the kernel anyway. Or do you mean some cost from the header
for the compilation unit? That's strange argument. Kernel provides
infrastructure for a reason.

> Using vmap() we can still (almost) avoid private/shared conversion at IOCTL time
> by allocating a default size buffer (which is large enough to cover 99% cases,
> etc) at driver initialization time:
> 
> https://lore.kernel.org/lkml/20220422233418.1203092-2-sathyanarayanan.kuppuswamy@linux.intel.com/T/#maf7e5f6894548972c5de71f607199a79645856ff

I don't see a reason to invent ad-hoc solution if there's an establised
API for the task.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03  2:45         ` Kirill A. Shutemov
@ 2022-05-03  3:36           ` Kai Huang
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-03  3:36 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, 2022-05-03 at 05:45 +0300, Kirill A. Shutemov wrote:
> On Tue, May 03, 2022 at 02:18:10PM +1200, Kai Huang wrote:
> > On Tue, 2022-05-03 at 04:27 +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 02, 2022 at 02:40:26PM +1200, Kai Huang wrote:
> > > > 
> > > > > +
> > > > > +	/* Get order for Quote buffer page allocation */
> > > > > +	order = get_order(quote_req.len);
> > > > > +
> > > > > +	/*
> > > > > +	 * Allocate buffer to get TD Quote from the VMM.
> > > > > +	 * Size needs to be 4KB aligned (which is already
> > > > > +	 * met in page allocation).
> > > > > +	 */
> > > > > +	tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
> > > > > +	if (!tdquote) {
> > > > > +		ret = -ENOMEM;
> > > > > +		goto quote_failed;
> > > > > +	}
> > > > 
> > > > You can use alloc_pages_exact().
> > > > 
> > > > > +
> > > > > +	/*
> > > > > +	 * Since this buffer will be shared with the VMM via GetQuote
> > > > > +	 * hypercall, decrypt it.
> > > > > +	 */
> > > > > +	ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
> > > > > +	if (ret)
> > > > > +		goto quote_failed;
> > > > 
> > > > 
> > > > Again, Dave and Andi already commented you should use vmap() to avoid breaking
> > > > up the direct-mapping.  Please use vmap() instead.
> > > > 
> > > > https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
> > > > 
> > > > Will review the rest later.
> > > 
> > > I would rather convert it to use DMA API for memory allocation. It will
> > > tap into swiotlb buffer that already converted and there's no need to
> > > touch direct mapping. Both allocation and freeing such memory is cheaper
> > > because of that.
> > > 
> > 
> > Does each DMA allocation and free internally do the actual private/shared
> > conversion?  Or the swiotlb is converted at the beginning at boot and DMA
> > allocation will always get the shared buffer automatically?
> 
> It can remap as fallback, but usually it allocates from the pool.
> 
> > The problem of using DMA API is it will need to bring additional code to use
> > platform device, which isn't necessary.
> 
> Heh? DMA is in the kernel anyway. Or do you mean some cost from the header
> for the compilation unit? That's strange argument. Kernel provides
> infrastructure for a reason.

I mean using platform device is more complicated than using misc_register()
directly.  You can compare the code between v4 and v5.

> 
> > Using vmap() we can still (almost) avoid private/shared conversion at IOCTL time
> > by allocating a default size buffer (which is large enough to cover 99% cases,
> > etc) at driver initialization time:
> > 
> > https://lore.kernel.org/lkml/20220422233418.1203092-2-sathyanarayanan.kuppuswamy@linux.intel.com/T/#maf7e5f6894548972c5de71f607199a79645856ff
> 
> I don't see a reason to invent ad-hoc solution if there's an establised
> API for the task.
> 

DMA API can fit this job doesn't mean it fits better.  And I don't see why using
vmap() as I described above is a ad-hoc.

1) There's no real DMA involved in attestation.  Using DMA API is overkill.
2) DMA buffers are always shared, but this is only true for now.  It's not
guaranteed to be true for future generation of TDX.

It's just a little bit weird to use DMA API when there's no real device and no
real DMA is involved.  It's much more like using DMA API for convenience
purpose. 


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-02 23:17         ` Sathyanarayanan Kuppuswamy
  2022-05-02 23:37           ` Kai Huang
@ 2022-05-03 14:38           ` Wander Costa
  2022-05-03 15:09             ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 56+ messages in thread
From: Wander Costa @ 2022-05-03 14:38 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, open list

On Mon, May 2, 2022 at 8:17 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:

> > [...]
>
> I don't want to pollute the dmesg logs if possible. For IOCTL use case,
> the return value can be used to understand the failure reason from user
> code. But for initcall failure, pr_err message is required to understand
> the failure reason.

How often is this call expected to fail?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-03 14:38           ` Wander Costa
@ 2022-05-03 15:09             ` Sathyanarayanan Kuppuswamy
  2022-05-03 22:08               ` Kai Huang
  0 siblings, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-03 15:09 UTC (permalink / raw)
  To: Wander Costa
  Cc: Kai Huang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H . Peter Anvin, Kirill A . Shutemov,
	Tony Luck, Andi Kleen, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, open list



On 5/3/22 7:38 AM, Wander Costa wrote:
>> I don't want to pollute the dmesg logs if possible. For IOCTL use case,
>> the return value can be used to understand the failure reason from user
>> code. But for initcall failure, pr_err message is required to understand
>> the failure reason.
> How often is this call expected to fail?

In general, it should not fail (so very low fail frequency). But the
point is, we can easily understand this failure from user end. So we
don't need to print more in non-debug environment.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver
  2022-05-03 15:09             ` Sathyanarayanan Kuppuswamy
@ 2022-05-03 22:08               ` Kai Huang
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-03 22:08 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Wander Costa
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, open list

On Tue, 2022-05-03 at 08:09 -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/3/22 7:38 AM, Wander Costa wrote:
> > > I don't want to pollute the dmesg logs if possible. For IOCTL use case,
> > > the return value can be used to understand the failure reason from user
> > > code. But for initcall failure, pr_err message is required to understand
> > > the failure reason.
> > How often is this call expected to fail?
> 
> In general, it should not fail (so very low fail frequency). But the
> point is, we can easily understand this failure from user end. So we
> don't need to print more in non-debug environment.
> 
> > 

To support your statement, all the error codes return to userspace need to be
clearly documented around the IOCTL in the uapi header.  But I think you have to
do this anyway.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03  2:39         ` Sathyanarayanan Kuppuswamy
@ 2022-05-03 22:13           ` Kai Huang
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-03 22:13 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Mon, 2022-05-02 at 19:39 -0700, Sathyanarayanan Kuppuswamy wrote:
> > 
> > Using vmap() we can still (almost) avoid private/shared conversion at IOCTL
> > time
> > by allocating a default size buffer (which is large enough to cover 99%
> > cases,
> > etc) at driver initialization time:
> 
> Allocating fixed size buffer pool will work for dma buffer allocation
> as well.
> 
> So the comparison is between platform driver boilerplate code vs vmap
> and shared/unshared code addition. It is arguable which is better. I
> think it is about preference.

Not really.  DMA buffer is guaranteed to be shared for now, but it's not
guaranteed in the future generations of TDX.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03  1:27     ` Kirill A. Shutemov
  2022-05-03  2:18       ` Kai Huang
@ 2022-05-03 22:24       ` Dave Hansen
  2022-05-03 22:28         ` Sathyanarayanan Kuppuswamy
  2022-05-04 22:49         ` Sathyanarayanan Kuppuswamy
  1 sibling, 2 replies; 56+ messages in thread
From: Dave Hansen @ 2022-05-03 22:24 UTC (permalink / raw)
  To: Kirill A. Shutemov, Kai Huang
  Cc: Kuppuswamy Sathyanarayanan, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Kirill A . Shutemov, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On 5/2/22 18:27, Kirill A. Shutemov wrote:
>> Again, Dave and Andi already commented you should use vmap() to avoid breaking
>> up the direct-mapping.  Please use vmap() instead.
>>
>> https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
>>
>> Will review the rest later.
> I would rather convert it to use DMA API for memory allocation. It will
> tap into swiotlb buffer that already converted and there's no need to
> touch direct mapping. Both allocation and freeing such memory is cheaper
> because of that.

Sathya, I don't quite understand why you are so forcefully declining to
incorporate review feedback on this point.  I gave very specific
feedback about the kind of mapping you need and that you should avoid
fragmenting the direct map if at all possible.

Why is this code still fragmenting the direct map?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03 22:24       ` Dave Hansen
@ 2022-05-03 22:28         ` Sathyanarayanan Kuppuswamy
  2022-05-03 22:30           ` Sathyanarayanan Kuppuswamy
  2022-05-04 22:49         ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-03 22:28 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Kai Huang
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/3/22 3:24 PM, Dave Hansen wrote:
> On 5/2/22 18:27, Kirill A. Shutemov wrote:
>>> Again, Dave and Andi already commented you should use vmap() to avoid breaking
>>> up the direct-mapping.  Please use vmap() instead.
>>>
>>> https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
>>>
>>> Will review the rest later.
>> I would rather convert it to use DMA API for memory allocation. It will
>> tap into swiotlb buffer that already converted and there's no need to
>> touch direct mapping. Both allocation and freeing such memory is cheaper
>> because of that.
> 
> Sathya, I don't quite understand why you are so forcefully declining to
> incorporate review feedback on this point.  I gave very specific
> feedback about the kind of mapping you need and that you should avoid
> fragmenting the direct map if at all possible.
> 
> Why is this code still fragmenting the direct map?

I have already implemented it and testing it now.

In this discussion, we are comparing the use of DMA API for memory
allocation vs vmap/sharing it in driver itself.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03 22:28         ` Sathyanarayanan Kuppuswamy
@ 2022-05-03 22:30           ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-03 22:30 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Kai Huang
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/3/22 3:28 PM, Sathyanarayanan Kuppuswamy wrote:
> 
> 
> On 5/3/22 3:24 PM, Dave Hansen wrote:
>> On 5/2/22 18:27, Kirill A. Shutemov wrote:
>>>> Again, Dave and Andi already commented you should use vmap() to 
>>>> avoid breaking
>>>> up the direct-mapping.  Please use vmap() instead.
>>>>
>>>> https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/ 
>>>>
>>>>
>>>> Will review the rest later.
>>> I would rather convert it to use DMA API for memory allocation. It will
>>> tap into swiotlb buffer that already converted and there's no need to
>>> touch direct mapping. Both allocation and freeing such memory is cheaper
>>> because of that.
>>
>> Sathya, I don't quite understand why you are so forcefully declining to
>> incorporate review feedback on this point.  I gave very specific
>> feedback about the kind of mapping you need and that you should avoid
>> fragmenting the direct map if at all possible.
>>
>> Why is this code still fragmenting the direct map?
> 
> I have already implemented it and testing it now.

I mean, I have already implemented the vmap based solution.

> 
> In this discussion, we are comparing the use of DMA API for memory
> allocation vs vmap/sharing it in driver itself.
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-03 22:24       ` Dave Hansen
  2022-05-03 22:28         ` Sathyanarayanan Kuppuswamy
@ 2022-05-04 22:49         ` Sathyanarayanan Kuppuswamy
  2022-05-04 23:28           ` Kai Huang
  2022-05-05 10:50           ` Kai Huang
  1 sibling, 2 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-04 22:49 UTC (permalink / raw)
  To: Dave Hansen, Kirill A. Shutemov, Kai Huang
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi,

On 5/3/22 3:24 PM, Dave Hansen wrote:
> On 5/2/22 18:27, Kirill A. Shutemov wrote:
>>> Again, Dave and Andi already commented you should use vmap() to avoid breaking
>>> up the direct-mapping.  Please use vmap() instead.
>>>
>>> https://lore.kernel.org/all/ce0feeec-a949-35f8-3010-b0d69acbbc2e@linux.intel.com/
>>>
>>> Will review the rest later.
>> I would rather convert it to use DMA API for memory allocation. It will
>> tap into swiotlb buffer that already converted and there's no need to
>> touch direct mapping. Both allocation and freeing such memory is cheaper
>> because of that.
> 
> Sathya, I don't quite understand why you are so forcefully declining to
> incorporate review feedback on this point.  I gave very specific
> feedback about the kind of mapping you need and that you should avoid
> fragmenting the direct map if at all possible.
> 
> Why is this code still fragmenting the direct map?

Next version will looks like below. It includes vmap support. It also
supports parallel GetQuote reuests by using list to track active
GetQuote request list. Did some inital test. It seems to work fine.

Please take a look at it and let me know your comments.

I will continue to test more cases in parallel.

diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
index 81b59d060392..0ed959dacb7e 100644
--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -13,20 +13,44 @@
  #include <linux/miscdevice.h>
  #include <linux/mm.h>
  #include <linux/io.h>
+#include <linux/set_memory.h>
+#include <linux/spinlock.h>
  #include <asm/tdx.h>
+#include <asm/coco.h>
  #include <uapi/asm/tdx.h>

  #define DRIVER_NAME "tdx-attest"

  /* TDREPORT module call leaf ID */
  #define TDX_GET_REPORT                 4
+/* GetQuote hypercall leaf ID */
+#define TDVMCALL_GET_QUOTE             0x10002

  /* Upper 32 bits has the status code, so mask it */
  #define TDCALL_STATUS_CODE_MASK                0xffffffff00000000
  #define TDCALL_STATUS_CODE(a)          ((a) & TDCALL_STATUS_CODE_MASK)

+/* Used for buffer allocation in GetQuote request */
+struct quote_buf {
+       struct page **pages;
+       int count;
+       void *vmaddr;
+};
+
+/* List entry of quote_list*/
+struct quote_entry {
+       struct quote_buf *buf;
+       struct completion compl;
+       struct list_head list;
+};
+
  static struct miscdevice miscdev;

+/* List to track active GetQuote requests */
+static LIST_HEAD(quote_list);
+/* Lock to protect quote_list */
+static DEFINE_SPINLOCK(quote_lock);
+
  static long tdx_get_report(void __user *argp)
  {
         void *reportdata = NULL, *tdreport = NULL;
@@ -76,6 +100,214 @@ static long tdx_get_report(void __user *argp)
         return ret;
  }

+/* tdx_get_quote_hypercall() - Request to get TD Quote using TDREPORT */
+static long tdx_get_quote_hypercall(struct quote_buf *buf)
+{
+       struct tdx_hypercall_args args = {0};
+
+       args.r10 = TDX_HYPERCALL_STANDARD;
+       args.r11 = TDVMCALL_GET_QUOTE;
+       args.r12 = cc_mkdec(page_to_phys(vmalloc_to_page(buf->vmaddr)));
+       args.r13 = buf->count * PAGE_SIZE;
+
+       /*
+        * Pass the physical address of TDREPORT to the VMM and
+        * trigger the Quote generation. It is not a blocking
+        * call, hence completion of this request will be notified to
+        * the TD guest via a callback interrupt. More info about ABI
+        * can be found in TDX Guest-Host-Communication Interface
+        * (GHCI), sec titled "TDG.VP.VMCALL<GetQuote>".
+        */
+       return __tdx_hypercall(&args, 0);
+}
+
+static struct quote_buf *alloc_quote_buf(u64 req_size)
+{
+       int size = PAGE_ALIGN(req_size);
+       void *addr = NULL, *vmaddr = NULL;
+       int count = size >> PAGE_SHIFT;
+       struct quote_buf *buf;
+       struct page **pages;
+       int i;
+
+       buf = kmalloc(sizeof(*buf), GFP_KERNEL);
+       if (!buf)
+               return NULL;
+
+       /* Allocate mem for array of page ptrs */
+       pages = kcalloc(count, sizeof(*pages), GFP_KERNEL);
+       if (!pages)
+               goto alloc_failed;
+
+       addr = alloc_pages_exact(size, GFP_KERNEL);
+       if (!addr)
+               goto alloc_failed;
+
+       for (i = 0; i < count; i++)
+               pages[i] = virt_to_page(addr + i * PAGE_SIZE);
+
+       vmaddr = vmap(pages, count, VM_MAP, PAGE_KERNEL);
+       if (!vmaddr)
+               goto alloc_failed;
+
+       /* Mark the memory shared to allow VMM access it */
+       if (set_memory_decrypted((unsigned long)vmaddr, count))
+               goto alloc_failed;
+
+       buf->pages = pages;
+       buf->count = count;
+       buf->vmaddr = vmaddr;
+
+       return buf;
+
+alloc_failed:
+       if (vmaddr)
+               vunmap(vmaddr);
+       if (addr)
+               free_pages_exact(addr, size);
+       kfree(pages);
+       kfree(buf);
+       return NULL;
+}
+
+static void free_quote_buf(struct quote_buf *buf)
+{
+       if (!buf)
+               return;
+
+       /* Mark pages private */
+       if (set_memory_encrypted((unsigned long)buf->vmaddr, buf->count)) {
+               pr_warn("Failed to encrypt %d pages at %p", buf->count,
+                               buf->vmaddr);
+               return;
+       }
+
+       vunmap(buf->vmaddr);
+
+       while (buf->count--)
+               __free_page(buf->pages[buf->count]);
+
+       kfree(buf->pages);
+       kfree(buf);
+}
+
+static struct quote_entry *alloc_quote_entry(struct tdx_quote_req 
*quote_req)
+{
+       struct quote_entry *entry = NULL;
+
+       entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+       if (!entry)
+               return NULL;
+
+       /* Allocate buffer for quote request */
+       entry->buf = alloc_quote_buf(quote_req->len);
+       if (entry->buf) {
+               kfree(entry);
+               return NULL;
+       }
+
+       init_completion(&entry->compl);
+
+       return entry;
+}
+
+static void free_quote_entry(struct quote_entry *entry)
+{
+       free_quote_buf(entry->buf);
+       kfree(entry);
+}
+
+void add_quote_entry(struct quote_entry *entry)
+{
+       unsigned long flags;
+
+       spin_lock_irqsave(&quote_lock, flags);
+       list_add_tail(&entry->list, &quote_list);
+       spin_unlock_irqrestore(&quote_lock, flags);
+}
+
+void del_quote_entry(struct quote_entry *entry)
+{
+       unsigned long flags;
+
+       spin_lock_irqsave(&quote_lock, flags);
+       list_del(&entry->list);
+       spin_unlock_irqrestore(&quote_lock, flags);
+}
+
+static long tdx_get_quote(void __user *argp)
+{
+       struct tdx_quote_req quote_req;
+       struct quote_entry *entry;
+       struct quote_buf *buf;
+       long ret = 0;
+
+       /* Copy GetQuote request struct from user buffer */
+       if (copy_from_user(&quote_req, argp, sizeof(struct tdx_quote_req)))
+               return -EFAULT;
+
+       /* Make sure the length is valid */
+       if (!quote_req.len)
+               return -EINVAL;
+
+       entry = alloc_quote_entry(&quote_req);
+       if (!entry)
+               return -ENOMEM;
+
+       buf = entry->buf;
+
+       /* Copy TDREPORT from user buffer to kernel Quote buffer */
+       if (copy_from_user(buf->vmaddr, (void __user *)quote_req.buf,
+                               quote_req.len)) {
+               ret = -EFAULT;
+               goto free_entry;
+       }
+
+       /* Submit GetQuote Request */
+       ret = tdx_get_quote_hypercall(buf);
+       if (ret) {
+               pr_err("GetQuote hypercall failed, status:%lx\n", ret);
+               ret = -EIO;
+               goto free_entry;
+       }
+
+       /* Add current quote entry to quote_list */
+       add_quote_entry(entry);
+
+       /* Wait for attestation completion */
+       ret = wait_for_completion_interruptible(&entry->compl);
+       if (ret < 0) {
+               ret = -EIO;
+               goto del_entry;
+       }
+
+       /* Copy output data back to user buffer */
+       if (copy_to_user((void __user *)quote_req.buf, buf->vmaddr, 
quote_req.len))
+               ret = -EFAULT;
+
+del_entry:
+       del_quote_entry(entry);
+free_entry:
+       free_quote_entry(entry);
+       return ret;
+}
+
+static void attestation_callback_handler(void)
+{
+       struct tdx_quote_hdr *quote_hdr;
+       struct quote_entry *entry;
+
+       /* Find processed quote request and mark it complete */
+       spin_lock(&quote_lock);
+       list_for_each_entry(entry, &quote_list, list) {
+               quote_hdr = (struct tdx_quote_hdr *)entry->buf->vmaddr;
+               /* If status is either success or failure, mark it 
complete */
+               if (quote_hdr->status != GET_QUOTE_IN_FLIGHT)
+                       complete(&entry->compl);
+       }
+       spin_unlock(&quote_lock);
+}
+
  static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
                              unsigned long arg)
  {
@@ -86,6 +318,9 @@ static long tdx_attest_ioctl(struct file *file, 
unsigned int cmd,
         case TDX_CMD_GET_REPORT:
                 ret = tdx_get_report(argp);
                 break;
+       case TDX_CMD_GET_QUOTE:
+               ret = tdx_get_quote(argp);
+               break;
         default:
                 pr_debug("cmd %d not supported\n", cmd);
                 break;
@@ -108,6 +343,9 @@ static int __init tdx_attestation_init(void)
         if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
                 return -EIO;

+       /* Register attestation event notify handler */
+       tdx_setup_ev_notify_handler(attestation_callback_handler);
+
         miscdev.name = DRIVER_NAME;
         miscdev.minor = MISC_DYNAMIC_MINOR;
         miscdev.fops = &tdx_attest_fops;
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index b49211994864..ab6c94cd880d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -15,6 +15,7 @@
  #include <asm/idtentry.h>
  #include <asm/irq_regs.h>
  #include <asm/desc.h>
+#include <asm/io.h>

  /* TDX module Call Leaf IDs */
  #define TDX_GET_INFO                   1
@@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start, 
unsigned long len,
   */
  static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, 
bool enc)
  {
-       phys_addr_t start = __pa(vaddr);
-       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
+       phys_addr_t start;
+       phys_addr_t end;
+
+       if (is_vmalloc_addr((void *)vaddr))
+               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
+       else
+               start = __pa(vaddr);
+
+       end = start + numpages * PAGE_SIZE;

         if (!enc) {
                 /* Set the shared (decrypted) bits: */
diff --git a/arch/x86/include/uapi/asm/tdx.h 
b/arch/x86/include/uapi/asm/tdx.h
index 5b721e0ebbb8..3fb69fe3ffb7 100644
--- a/arch/x86/include/uapi/asm/tdx.h
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -33,4 +33,43 @@ struct tdx_report_req {
  /* Get TDREPORT using TDCALL[TDG.MR.REPORT) */
  #define TDX_CMD_GET_REPORT             _IOWR('T', 0x01, struct 
tdx_report_req)

+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf                : Pass user data that includes TDREPORT as 
input. Upon
+ *               successful completion of IOCTL, output is copied
+ *               back to the same buffer.
+ * @len                : Length of the buffer.
+ */
+struct tdx_quote_req {
+       __u64 buf;
+       __u64 len;
+};
+
+/* Get TD Quote from QE/QGS using GetQuote TDVMCALL */
+#define TDX_CMD_GET_QUOTE              _IOR('T', 0x02, struct 
tdx_quote_req)
+
+/* TD Quote status codes */
+#define GET_QUOTE_SUCCESS              0
+#define GET_QUOTE_IN_FLIGHT            0xffffffffffffffff
+#define GET_QUOTE_ERROR                        0x8000000000000000
+#define GET_QUOTE_SERVICE_UNAVAILABLE  0x8000000000000001
+
+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+       /* Quote version, filled by TD */
+       __u64 version;
+       /* Status code of Quote request, filled by VMM */
+       __u64 status;
+       /* Length of TDREPORT, filled by TD */
+       __u32 in_len;
+       /* Length of Quote, filled by VMM */
+       __u32 out_len;
+       /* Actual Quote data */
+       __u64 data[0];
+};
+
  #endif /* _UAPI_ASM_X86_TDX_H */


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-04 22:49         ` Sathyanarayanan Kuppuswamy
@ 2022-05-04 23:28           ` Kai Huang
  2022-05-05 20:53             ` Sathyanarayanan Kuppuswamy
  2022-05-05 10:50           ` Kai Huang
  1 sibling, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-04 23:28 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Wed, 2022-05-04 at 15:49 -0700, Sathyanarayanan Kuppuswamy wrote:
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -15,6 +15,7 @@
>   #include <asm/idtentry.h>
>   #include <asm/irq_regs.h>
>   #include <asm/desc.h>
> +#include <asm/io.h>
> 
>   /* TDX module Call Leaf IDs */
>   #define TDX_GET_INFO                   1
> @@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start, 
> unsigned long len,
>    */
>   static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, 
> bool enc)
>   {
> -       phys_addr_t start = __pa(vaddr);
> -       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> +       phys_addr_t start;
> +       phys_addr_t end;
> +
> +       if (is_vmalloc_addr((void *)vaddr))
> +               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
> +       else
> +               start = __pa(vaddr);
> +
> +       end = start + numpages * PAGE_SIZE;
> 
>          if (!enc) {
>                  /* Set the shared (decrypted) bits: */

Looks set_memory_decrypted() only works for direct-mapping, so you should not
use this.  Instead, you can pass shared bit in 'prot' argument (using
pgprot_decrypted()) when you call vmap(), and explicitly call MapGPA().

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-04 22:49         ` Sathyanarayanan Kuppuswamy
  2022-05-04 23:28           ` Kai Huang
@ 2022-05-05 10:50           ` Kai Huang
  2022-05-05 19:03             ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-05 10:50 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel


> +       /* Submit GetQuote Request */
> +       ret = tdx_get_quote_hypercall(buf);
> +       if (ret) {
> +               pr_err("GetQuote hypercall failed, status:%lx\n", ret);
> +               ret = -EIO;
> +               goto free_entry;
> +       }
> +
> +       /* Add current quote entry to quote_list */
> +       add_quote_entry(entry);
> +
> +       /* Wait for attestation completion */
> +       ret = wait_for_completion_interruptible(&entry->compl);
> +       if (ret < 0) {
> +               ret = -EIO;
> +               goto del_entry;
> +       }

This is misuse of wait_for_completion_interruptible(). 

xxx_interruptible() essentially means this operation can be interrupted by
signal.  Using xxx_interruptible() in driver IOCTL essentially means when it
returns due to signal, the IOCTL should return -EINTR to let userspace know that
your application received some signal needs handling, and this IOCTL isn't
finished and you should retry.  So here we should return -EINTR (and cleanup all
staff has been done) when wait_for_completion_interruptible() returns -
ERESTARTSYS (in fact, it returns only -ERESTARTSYS or 0).

Since normally userspace application just ignore signals, and in this particular
case, asking userspace to retry just makes things more complicated to handle, I
think you can just use wait_for_completion_killable(), which only returns when
the application receives signal that it is going to be killed.

> +
> +       /* Copy output data back to user buffer */
> +       if (copy_to_user((void __user *)quote_req.buf, buf->vmaddr, 
> quote_req.len))
> +               ret = -EFAULT;
> +
> +del_entry:
> +       del_quote_entry(entry);
> +free_entry:
> +       free_quote_entry(entry);

As I (and Isaku) mentioned before, when wait_for_completion_killable() returns
with error, you cannot just convert the buffer to private and free it.  The VMM
is still owning it (IN_FLIGHT).

One way to handle is you can put those buffers that are still owned by VMM to a
new list, and have some kernel thread to periodically check buffer's status and
free those are already released by VMM.  I haven't thought thoroughly, so maybe
there's better way to handle, though.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 10:50           ` Kai Huang
@ 2022-05-05 19:03             ` Sathyanarayanan Kuppuswamy
  2022-05-05 22:25               ` Kai Huang
  0 siblings, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-05 19:03 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi Kai,

On 5/5/22 3:50 AM, Kai Huang wrote:
> 
>> +       /* Submit GetQuote Request */
>> +       ret = tdx_get_quote_hypercall(buf);
>> +       if (ret) {
>> +               pr_err("GetQuote hypercall failed, status:%lx\n", ret);
>> +               ret = -EIO;
>> +               goto free_entry;
>> +       }
>> +
>> +       /* Add current quote entry to quote_list */
>> +       add_quote_entry(entry);
>> +
>> +       /* Wait for attestation completion */
>> +       ret = wait_for_completion_interruptible(&entry->compl);
>> +       if (ret < 0) {
>> +               ret = -EIO;
>> +               goto del_entry;
>> +       }
> 
> This is misuse of wait_for_completion_interruptible().
> 
> xxx_interruptible() essentially means this operation can be interrupted by
> signal.  Using xxx_interruptible() in driver IOCTL essentially means when it
> returns due to signal, the IOCTL should return -EINTR to let userspace know that
> your application received some signal needs handling, and this IOCTL isn't
> finished and you should retry.  So here we should return -EINTR (and cleanup all
> staff has been done) when wait_for_completion_interruptible() returns -
> ERESTARTSYS (in fact, it returns only -ERESTARTSYS or 0).


But in this case, I was expecting the user agent to check the Quote
buffer status code to understand the IN_FLIGHT, SUCCESS or FAILURE
status and handle it appropriately. So IMO, it should not matter what
we return for the failure case. For the IN_FLIGHT case, they can retry
if they want after checking the status code.

But I agree that EINTR is the appropriate return value for an
interrupted case. So, I will change it.

> 
> Since normally userspace application just ignore signals, and in this particular
> case, asking userspace to retry just makes things more complicated to handle, I

I am not sure how the user agent is going to be implemented. So I don't
want to make any assumptions. In this case, we are not asking user space
to implement the retry support using signals. But we are just giving 
them option to do it. It is up to them if they want to use it.

> think you can just use wait_for_completion_killable(), which only returns when
> the application receives signal that it is going to be killed.

If you agree with the above point, we can leave just it as 
*_interruptible(). But if you still see other issues, please let me
know.

> 
>> +
>> +       /* Copy output data back to user buffer */
>> +       if (copy_to_user((void __user *)quote_req.buf, buf->vmaddr,
>> quote_req.len))
>> +               ret = -EFAULT;
>> +
>> +del_entry:
>> +       del_quote_entry(entry);
>> +free_entry:
>> +       free_quote_entry(entry);
> 
> As I (and Isaku) mentioned before, when wait_for_completion_killable() returns
> with error, you cannot just convert the buffer to private and free it.  The VMM
> is still owning it (IN_FLIGHT).

Do you know what happens when VMM writes to a page which already marked
private? Will MapGPA fail during shared-private conversion?

> 
> One way to handle is you can put those buffers that are still owned by VMM to a
> new list, and have some kernel thread to periodically check buffer's status and
> free those are already released by VMM.  I haven't thought thoroughly, so maybe
> there's better way to handle, though.

Instead of adding new thread to just handle the cleanup, maybe I
can move the entire callback interrupt logic (contents of
attestation_callback_handler()) to a work queue and wake up this
work queue whenever we get the callback interrupt.

We can let the same work queue handle the cleanup for interrupted
requests. As for as how to identify the interrupted request, we
can add a bit in queue_entry for it and set it when we exit
wait_for_completion*() due to signals.

I will do a sample logic and get back to you.



> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-04 23:28           ` Kai Huang
@ 2022-05-05 20:53             ` Sathyanarayanan Kuppuswamy
  2022-05-05 22:15               ` Kai Huang
  0 siblings, 1 reply; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-05 20:53 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

Hi Kai,

On 5/4/22 4:28 PM, Kai Huang wrote:
> On Wed, 2022-05-04 at 15:49 -0700, Sathyanarayanan Kuppuswamy wrote:
>> --- a/arch/x86/coco/tdx/tdx.c
>> +++ b/arch/x86/coco/tdx/tdx.c
>> @@ -15,6 +15,7 @@
>>    #include <asm/idtentry.h>
>>    #include <asm/irq_regs.h>
>>    #include <asm/desc.h>
>> +#include <asm/io.h>
>>
>>    /* TDX module Call Leaf IDs */
>>    #define TDX_GET_INFO                   1
>> @@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start,
>> unsigned long len,
>>     */
>>    static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
>> bool enc)
>>    {
>> -       phys_addr_t start = __pa(vaddr);
>> -       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
>> +       phys_addr_t start;
>> +       phys_addr_t end;
>> +
>> +       if (is_vmalloc_addr((void *)vaddr))
>> +               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
>> +       else
>> +               start = __pa(vaddr);
>> +
>> +       end = start + numpages * PAGE_SIZE;
>>
>>           if (!enc) {
>>                   /* Set the shared (decrypted) bits: */
> 
> Looks set_memory_decrypted() only works for direct-mapping, so you should not
> use this.  Instead, you can pass shared bit in 'prot' argument (using
> pgprot_decrypted()) when you call vmap(), and explicitly call MapGPA().

Is it because of the above change, or you see other direct-mapping 
dependencies in set_memory_*() functions?


> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 20:53             ` Sathyanarayanan Kuppuswamy
@ 2022-05-05 22:15               ` Kai Huang
  2022-05-05 22:38                 ` Sathyanarayanan Kuppuswamy
  2022-05-05 23:06                 ` Dave Hansen
  0 siblings, 2 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-05 22:15 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-05-05 at 13:53 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai,
> 
> On 5/4/22 4:28 PM, Kai Huang wrote:
> > On Wed, 2022-05-04 at 15:49 -0700, Sathyanarayanan Kuppuswamy wrote:
> > > --- a/arch/x86/coco/tdx/tdx.c
> > > +++ b/arch/x86/coco/tdx/tdx.c
> > > @@ -15,6 +15,7 @@
> > >    #include <asm/idtentry.h>
> > >    #include <asm/irq_regs.h>
> > >    #include <asm/desc.h>
> > > +#include <asm/io.h>
> > > 
> > >    /* TDX module Call Leaf IDs */
> > >    #define TDX_GET_INFO                   1
> > > @@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start,
> > > unsigned long len,
> > >     */
> > >    static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
> > > bool enc)
> > >    {
> > > -       phys_addr_t start = __pa(vaddr);
> > > -       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> > > +       phys_addr_t start;
> > > +       phys_addr_t end;
> > > +
> > > +       if (is_vmalloc_addr((void *)vaddr))
> > > +               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
> > > +       else
> > > +               start = __pa(vaddr);
> > > +
> > > +       end = start + numpages * PAGE_SIZE;
> > > 
> > >           if (!enc) {
> > >                   /* Set the shared (decrypted) bits: */
> > 
> > Looks set_memory_decrypted() only works for direct-mapping, so you should not
> > use this.  Instead, you can pass shared bit in 'prot' argument (using
> > pgprot_decrypted()) when you call vmap(), and explicitly call MapGPA().
> 
> Is it because of the above change, or you see other direct-mapping 
> dependencies in set_memory_*() functions?
> 
> 
> 
set_memory_xx()  is supposedly only for direct-mapping.  Please use my
suggestion above.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 19:03             ` Sathyanarayanan Kuppuswamy
@ 2022-05-05 22:25               ` Kai Huang
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-05 22:25 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-05-05 at 12:03 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai,
> 
> On 5/5/22 3:50 AM, Kai Huang wrote:
> > 
> > > +       /* Submit GetQuote Request */
> > > +       ret = tdx_get_quote_hypercall(buf);
> > > +       if (ret) {
> > > +               pr_err("GetQuote hypercall failed, status:%lx\n", ret);
> > > +               ret = -EIO;
> > > +               goto free_entry;
> > > +       }
> > > +
> > > +       /* Add current quote entry to quote_list */
> > > +       add_quote_entry(entry);
> > > +
> > > +       /* Wait for attestation completion */
> > > +       ret = wait_for_completion_interruptible(&entry->compl);
> > > +       if (ret < 0) {
> > > +               ret = -EIO;
> > > +               goto del_entry;
> > > +       }
> > 
> > This is misuse of wait_for_completion_interruptible().
> > 
> > xxx_interruptible() essentially means this operation can be interrupted by
> > signal.  Using xxx_interruptible() in driver IOCTL essentially means when it
> > returns due to signal, the IOCTL should return -EINTR to let userspace know that
> > your application received some signal needs handling, and this IOCTL isn't
> > finished and you should retry.  So here we should return -EINTR (and cleanup all
> > staff has been done) when wait_for_completion_interruptible() returns -
> > ERESTARTSYS (in fact, it returns only -ERESTARTSYS or 0).
> 
> 
> But in this case, I was expecting the user agent to check the Quote
> buffer status code to understand the IN_FLIGHT, SUCCESS or FAILURE
> status and handle it appropriately. So IMO, it should not matter what
> we return for the failure case. For the IN_FLIGHT case, they can retry
> if they want after checking the status code.

Couple of issues around your statement:

1) When wait_for_completion_interruptible() returns error, you never copied the
data back to userspace.  Therefore userspace cannot check buffer status.  Your
assumption is wrong.
2) Even if you copy the data back to userspace, there's no guarantee *after* you
do the copy, the VMM won't update the buffer status.  So this doesn't work.
3) You only provide one IOCTL.  Even if userspace can retry, you will create a
new buffer, and ask VMM to do it again.  Then what happens to the old buffer? 
VMM is still owning it, and can update it, but you have already converted it
back to private, and freed it.

I really don't see how your statement is correct.

> 
> But I agree that EINTR is the appropriate return value for an
> interrupted case. So, I will change it.
> 
> > 
> > Since normally userspace application just ignore signals, and in this particular
> > case, asking userspace to retry just makes things more complicated to handle, I
> 
> I am not sure how the user agent is going to be implemented. So I don't
> want to make any assumptions. In this case, we are not asking user space
> to implement the retry support using signals. But we are just giving 
> them option to do it. It is up to them if they want to use it.

As I see you are not providing any functionality to allow userspace to retry.



-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 22:15               ` Kai Huang
@ 2022-05-05 22:38                 ` Sathyanarayanan Kuppuswamy
  2022-05-05 23:06                 ` Dave Hansen
  1 sibling, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-05 22:38 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/5/22 3:15 PM, Kai Huang wrote:
> On Thu, 2022-05-05 at 13:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi Kai,
>>
>> On 5/4/22 4:28 PM, Kai Huang wrote:
>>> On Wed, 2022-05-04 at 15:49 -0700, Sathyanarayanan Kuppuswamy wrote:
>>>> --- a/arch/x86/coco/tdx/tdx.c
>>>> +++ b/arch/x86/coco/tdx/tdx.c
>>>> @@ -15,6 +15,7 @@
>>>>     #include <asm/idtentry.h>
>>>>     #include <asm/irq_regs.h>
>>>>     #include <asm/desc.h>
>>>> +#include <asm/io.h>
>>>>
>>>>     /* TDX module Call Leaf IDs */
>>>>     #define TDX_GET_INFO                   1
>>>> @@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start,
>>>> unsigned long len,
>>>>      */
>>>>     static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
>>>> bool enc)
>>>>     {
>>>> -       phys_addr_t start = __pa(vaddr);
>>>> -       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
>>>> +       phys_addr_t start;
>>>> +       phys_addr_t end;
>>>> +
>>>> +       if (is_vmalloc_addr((void *)vaddr))
>>>> +               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
>>>> +       else
>>>> +               start = __pa(vaddr);
>>>> +
>>>> +       end = start + numpages * PAGE_SIZE;
>>>>
>>>>            if (!enc) {
>>>>                    /* Set the shared (decrypted) bits: */
>>>
>>> Looks set_memory_decrypted() only works for direct-mapping, so you should not
>>> use this.  Instead, you can pass shared bit in 'prot' argument (using
>>> pgprot_decrypted()) when you call vmap(), and explicitly call MapGPA().
>>
>> Is it because of the above change, or you see other direct-mapping
>> dependencies in set_memory_*() functions?
>>
>>
>>
> set_memory_xx()  is supposedly only for direct-mapping.  Please use my
> suggestion above.

I did not find any other direct-mapping dependency in set_memory_*()
functions other than what I have fixed. If I missed anything, please
let me know.

Also, even if set_memory_*() functions does not support vmalloc'ed
memory,  IMO, it is better to add this support to it.

I want to avoid custom solution if it is possible to use generic
function.

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 22:15               ` Kai Huang
  2022-05-05 22:38                 ` Sathyanarayanan Kuppuswamy
@ 2022-05-05 23:06                 ` Dave Hansen
  2022-05-06  0:11                   ` Kai Huang
  2022-05-06 11:00                   ` Kai Huang
  1 sibling, 2 replies; 56+ messages in thread
From: Dave Hansen @ 2022-05-05 23:06 UTC (permalink / raw)
  To: Kai Huang, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On 5/5/22 15:15, Kai Huang wrote:
> set_memory_xx()  is supposedly only for direct-mapping.  Please use my
> suggestion above.

Kai, please take a look at some of the other users, especially
set_memory_x().  See how long the "supposed" requirement holds up.

That said, I've forgotten by now if this _could_ have used vmalloc() or
vmap() or vmap_pfn().  None of the logic about why or how the allocator
and mapping design decisions were made.  Could that be be rectified for
the next post?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 23:06                 ` Dave Hansen
@ 2022-05-06  0:11                   ` Kai Huang
  2022-05-06  1:55                     ` Sathyanarayanan Kuppuswamy
  2022-05-07  0:42                     ` Kirill A. Shutemov
  2022-05-06 11:00                   ` Kai Huang
  1 sibling, 2 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-06  0:11 UTC (permalink / raw)
  To: Dave Hansen, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-05-05 at 16:06 -0700, Dave Hansen wrote:
> On 5/5/22 15:15, Kai Huang wrote:
> > set_memory_xx()  is supposedly only for direct-mapping.  Please use my
> > suggestion above.
> 
> Kai, please take a look at some of the other users, especially
> set_memory_x().  See how long the "supposed" requirement holds up.

Right I should not have used "supposed".  My bad.  I got the impression by
roughly looking at set_memory_{uc|wc..}().  Looks they can only work on direct
mapaping as they internally uses __pa():

int set_memory_wc(unsigned long addr, int numpages)                            
{
        int ret;                                                               

        ret = memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,   
                _PAGE_CACHE_MODE_WC, NULL);                                    
        if (ret)
                return ret;                                                    

        ret = _set_memory_wc(addr, numpages);                                  
        if (ret)
                memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE);   

        return ret;                                                            
}

Don't all set_memory_xxx() functions have the same schematic?

> 
> That said, I've forgotten by now if this _could_ have used vmalloc() or
> vmap() or vmap_pfn().  None of the logic about why or how the allocator
> and mapping design decisions were made.  Could that be be rectified for
> the next post?

Looking at set_memory_{encrypted|decrypted}() again, it seems they currently
only works on direct mapping for TDX (as sathya's code has showed).  For AMD it
appears they can work on any virtual address as AMD uses lookup_address() to
find the PFN. 

So if the two are supposed to work on any virtual address, then it makes sense
to fix at TDX side.

Btw, regarding to my suggestion of using vmap() with prot_decrypted() +
MapGPA(), after thinking again, I think there is also a problem -- the TLB for
private mapping and the cache are not flushed.  So looks we should fix
set_memory_decrypted() to work on any virtual address and use it for vmap().

Back to the "why and how the allocator and mapping design decisions were made",
let me summarize options and my preference below:

1) Using DMA API.  This guarantees for TDX1.0 the allocated buffer is shared
(set_memory_decrypted() is called for swiotlb).  But this may not guarantee the
buffer is shared in future generation of TDX.  This of course depends on how we
are going to change those DMA API implementations for future TDX but
conceptually using DMA API is more like for convenience purpose.  Also, in order
to use DMA API, we need more code to handle additional 'platform device' which
is not mandatory for attestation driver.

2) Using vmap() + set_memory_decrypted().  This requires to change the latter to
support any virtual address for TDX.  But now I guess it's the right way since
it's better to have some infrastructure to convert private page to shared
besides DMA API anyway.

3) Using vmap() + MapGPA().  This requires additional work on TLB flush and
cache flush.  Now I think we should not use this.

Given above, I personally think 2) is better.

Kirill, what's your opinion?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-06  0:11                   ` Kai Huang
@ 2022-05-06  1:55                     ` Sathyanarayanan Kuppuswamy
  2022-05-07  0:42                     ` Kirill A. Shutemov
  1 sibling, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-06  1:55 UTC (permalink / raw)
  To: Kai Huang, Dave Hansen, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel



On 5/5/22 5:11 PM, Kai Huang wrote:
> On Thu, 2022-05-05 at 16:06 -0700, Dave Hansen wrote:
>> On 5/5/22 15:15, Kai Huang wrote:
>>> set_memory_xx()  is supposedly only for direct-mapping.  Please use my
>>> suggestion above.
>>
>> Kai, please take a look at some of the other users, especially
>> set_memory_x().  See how long the "supposed" requirement holds up.
> 
> Right I should not have used "supposed".  My bad.  I got the impression by
> roughly looking at set_memory_{uc|wc..}().  Looks they can only work on direct
> mapaping as they internally uses __pa():
> 
> int set_memory_wc(unsigned long addr, int numpages)
> {
>          int ret;
> 
>          ret = memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
>                  _PAGE_CACHE_MODE_WC, NULL);
>          if (ret)
>                  return ret;
> 
>          ret = _set_memory_wc(addr, numpages);
>          if (ret)
>                  memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE);
> 
>          return ret;
> }
> 
> Don't all set_memory_xxx() functions have the same schematic?
> 
>>
>> That said, I've forgotten by now if this _could_ have used vmalloc() or
>> vmap() or vmap_pfn().  None of the logic about why or how the allocator
>> and mapping design decisions were made.  Could that be be rectified for
>> the next post?

In addition to Kai's reply, a few more points about where this memory 
being used, and your comment history is listed below:

This memory is used by VMM to copy the TD Quote data after completing
the Quote generation request from the TD guest. It requires a physically
contiguous shared buffer of given length, which is passed to VMM using
GetQuote hypercall.

Initially, we have allocated this memory using alloc_pages* (or some
variant of it) and directly called set_memory_{de/en}crypted() to share
/unshare these pages. For the above-mentioned approach, you have
suggested to use vmap to not affect the direct mapping.

Regarding vmalloc(), we cannot use it because we need physically
contiguous space.

Regarding history about using DMA APIs vs VMAP approach is already
explained by Kai below.

I will add relevant details to the commit log in next patch submission.

> 
> Looking at set_memory_{encrypted|decrypted}() again, it seems they currently
> only works on direct mapping for TDX (as sathya's code has showed).  For AMD it
> appears they can work on any virtual address as AMD uses lookup_address() to
> find the PFN.
> 
> So if the two are supposed to work on any virtual address, then it makes sense
> to fix at TDX side.
> 
> Btw, regarding to my suggestion of using vmap() with prot_decrypted() +
> MapGPA(), after thinking again, I think there is also a problem -- the TLB for
> private mapping and the cache are not flushed.  So looks we should fix
> set_memory_decrypted() to work on any virtual address and use it for vmap().
> 
> Back to the "why and how the allocator and mapping design decisions were made",
> let me summarize options and my preference below:
> 
> 1) Using DMA API.  This guarantees for TDX1.0 the allocated buffer is shared
> (set_memory_decrypted() is called for swiotlb).  But this may not guarantee the
> buffer is shared in future generation of TDX.  This of course depends on how we
> are going to change those DMA API implementations for future TDX but
> conceptually using DMA API is more like for convenience purpose.  Also, in order
> to use DMA API, we need more code to handle additional 'platform device' which
> is not mandatory for attestation driver.
> 
> 2) Using vmap() + set_memory_decrypted().  This requires to change the latter to
> support any virtual address for TDX.  But now I guess it's the right way since
> it's better to have some infrastructure to convert private page to shared
> besides DMA API anyway.
> 
> 3) Using vmap() + MapGPA().  This requires additional work on TLB flush and
> cache flush.  Now I think we should not use this.
> 
> Given above, I personally think 2) is better.
> 
> Kirill, what's your opinion?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-05 23:06                 ` Dave Hansen
  2022-05-06  0:11                   ` Kai Huang
@ 2022-05-06 11:00                   ` Kai Huang
  2022-05-06 15:47                     ` Dave Hansen
  2022-05-07  1:00                     ` Kirill A. Shutemov
  1 sibling, 2 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-06 11:00 UTC (permalink / raw)
  To: Dave Hansen, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On Thu, 2022-05-05 at 16:06 -0700, Dave Hansen wrote:
> On 5/5/22 15:15, Kai Huang wrote:
> > set_memory_xx()  is supposedly only for direct-mapping.  Please use my
> > suggestion above.
> 
> Kai, please take a look at some of the other users, especially
> set_memory_x().  See how long the "supposed" requirement holds up.
> 
> That said, I've forgotten by now if this _could_ have used vmalloc() or
> vmap() or vmap_pfn().  None of the logic about why or how the allocator
> and mapping design decisions were made.  Could that be be rectified for
> the next post?

Hi Dave,

(Sorry previous reply was too fast..)

I spent some time looking into how __change_page_attr_set_clr() is implemented,
which is called by all set_memory_xxx() variants.  If my understanding is
correct, __change_page_attr_set_clr() will work for vmap() variants, because it
internally uses lookup_address(), which walks the page table directly, to find
the actual PTE (and PFN).  So set_memory_decrypted() can be fixed to support
vmap() mapping for TDX.

However, looking at the code, set_memory_decrypted() calls
__change_page_attr_set_clr(&cpa, 1).  The second argument is 'checkalias', which
means even we call set_memory_decrypted() against vmap() address, the aliasing
mappings will be changed too.  And if I understand correctly, the aliasing
mapping includes direct-mapping too:

static int cpa_process_alias(struct cpa_data *cpa)                             
{                                                                              
        struct cpa_data alias_cpa;
        unsigned long laddr = (unsigned long)__va(cpa->pfn << PAGE_SHIFT);     
        unsigned long vaddr;
        int ret;                                                               
                                                                                                                                                   
        if (!pfn_range_is_mapped(cpa->pfn, cpa->pfn + 1))                      
                return 0;                                                      

        /*
         * No need to redo, when the primary call touched the direct           
         * mapping already:                                                    
         */
        vaddr = __cpa_addr(cpa, cpa->curpage);                                 
        if (!(within(vaddr, PAGE_OFFSET,
                    PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT)))) {
                
                alias_cpa = *cpa; 
                alias_cpa.vaddr = &laddr;
                alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY);             
                alias_cpa.curpage = 0;                                         
                
                cpa->force_flush_all = 1;                                      

                ret = __change_page_attr_set_clr(&alias_cpa, 0);               
                if (ret)                                                       
                        return ret;
        }                                                                      

#ifdef CONFIG_X86_64                                                           
        /*
         * If the primary call didn't touch the high mapping already           
         * and the physical address is inside the kernel map, we need          
         * to touch the high mapped kernel as well:
         */                                                                    
        if (!within(vaddr, (unsigned long)_text, _brk_end) &&                  
            __cpa_pfn_in_highmap(cpa->pfn)) {                                  
                unsigned long temp_cpa_vaddr = (cpa->pfn << PAGE_SHIFT) +      
                                               __START_KERNEL_map - phys_base; 
                alias_cpa = *cpa;                                              
                alias_cpa.vaddr = &temp_cpa_vaddr;                             
                alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY);
                alias_cpa.curpage = 0;                                         

                cpa->force_flush_all = 1;
                /*
                 * The high mapping range is imprecise, so ignore the
                 * return value.
                 */
                __change_page_attr_set_clr(&alias_cpa, 0);
        }
#endif

        return 0;
}

As you can see, the first chunk checks whether the virtual address is direct-
mapping, and if it is not, direct mapping is changed too.

The second chunk even changes the high kernel mapping.

So, if we use set_memory_decrypted(), there's no difference whether the address
is vmap() or direct mapping address.  The direct mapping will be changed anyway.

(However, it seems if we call set_memory_decrypted() against direct mapping
address, the vmap() mapping won't be impacted, because it seems
cpa_process_alias() doesn't check vmap() area..).

However I may have missed something.  Kirill please help to confirm if you see
this.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-06 11:00                   ` Kai Huang
@ 2022-05-06 15:47                     ` Dave Hansen
  2022-05-07  1:00                     ` Kirill A. Shutemov
  1 sibling, 0 replies; 56+ messages in thread
From: Dave Hansen @ 2022-05-06 15:47 UTC (permalink / raw)
  To: Kai Huang, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Kirill A . Shutemov, Tony Luck, Andi Kleen,
	Wander Lairson Costa, Isaku Yamahata, marcelo.cerri, tim.gardner,
	khalid.elmously, philip.cox, linux-kernel

On 5/6/22 04:00, Kai Huang wrote:
> However I may have missed something.  Kirill please help to confirm if you see
> this.

Kai, thanks for taking a more thorough look through this.

No matter what mechanism is used here, I'd like to see some commit
message material about the testing that was performed, including
experimental confirmation that it doesn't affect the direct map.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-06  0:11                   ` Kai Huang
  2022-05-06  1:55                     ` Sathyanarayanan Kuppuswamy
@ 2022-05-07  0:42                     ` Kirill A. Shutemov
  2022-05-09  3:37                       ` Kai Huang
  1 sibling, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-07  0:42 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> Kirill, what's your opinion?

I said before that I think DMA API is the right tool here.

Speculation about future of DMA in TDX is irrelevant here. If semantics
change we will need to re-evaluate all users. VirtIO uses DMA API and it
is conceptually the same use-case: communicate with the host.

But vmap() + set_memory_decrypted() also works and Sathya already has code
for it. I'm fine with this.

Going a step below to manual MapGPA() is just wrong. We introduced
abstructions for a reason. Protocol of changing GPA status is not trivial.
We should not spread it across all kernel codebase.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-06 11:00                   ` Kai Huang
  2022-05-06 15:47                     ` Dave Hansen
@ 2022-05-07  1:00                     ` Kirill A. Shutemov
  1 sibling, 0 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-07  1:00 UTC (permalink / raw)
  To: Kai Huang
  Cc: Dave Hansen, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Fri, May 06, 2022 at 11:00:29PM +1200, Kai Huang wrote:
> However I may have missed something.  Kirill please help to confirm if you see
> this.

That's correct. set_memory_decrypted() will change direct mapping as well
even if called on vmap(). I should have catched it before. Sorry.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-07  0:42                     ` Kirill A. Shutemov
@ 2022-05-09  3:37                       ` Kai Huang
  2022-05-09 12:09                         ` Kirill A. Shutemov
  0 siblings, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-09  3:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Sathyanarayanan Kuppuswamy, Kirill A. Shutemov,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > Kirill, what's your opinion?
> 
> I said before that I think DMA API is the right tool here.
> 
> Speculation about future of DMA in TDX is irrelevant here. If semantics
> change we will need to re-evaluate all users. VirtIO uses DMA API and it
> is conceptually the same use-case: communicate with the host.

Virtio is designed for device driver to use, so it's fine to use DMA API. And
real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
assumption.

DMA API has it's limitations.  I don't see the protocol to convert GPA from
private to shared is so complicated (see below), and I think regardless this
attestation use case, it's perhaps worth to provide an additional simple
infrastructure for such case so it can be used when needed.

> 
> But vmap() + set_memory_decrypted() also works and Sathya already has code
> for it. I'm fine with this.

Dave said (again) he wanted to avoid breaking up direct mapping.

https://lore.kernel.org/lkml/5d34ac93-09dc-ea93-bffe-f3995647cd5b@linux.intel.com/T/#m37778b8af5d72c3db79e3cfa4b46ee37836f771c

So we need to use seither use DMA API (which already breaks direct-mapping for
swiotlb), or we use vmap() + MapGPA() as I mentioned below.

> 
> Going a step below to manual MapGPA() is just wrong. We introduced
> abstructions for a reason. Protocol of changing GPA status is not trivial.
> We should not spread it across all kernel codebase.
> 

I kinda disagree with this.  In fact, the protocol of changing GPA status isn't
that complicated.  TD guest can have both private and shared mappings to the
same GPA simultaneously.  We don't need to change all the mappings when
converting private page to shared.

For instance, we can use vmap() to have a shared mapping to a page, while the
page is still mapped as private in direct-mapping.  TD uses MapGPA() to tell VMM
which mapping it wants to use, and it just needs to guarantee that the private
(direct) mapping won't be used.  Speculative fetch using the direct mapping is
fine, as long as the core won't consume the data.  The only thing we need to
guarantee is we need to flush any dirty cachelines before MapGPA(). My
understanding is we don't even need to flush the TLB of the private mapping.

And reading the GHCI MapGPA() again, to me MapGPA() itself requires VMM to
guarantee the TLB and cache flush:

"
If the GPA (range) was already mapped as an active, private page, the host VMM
may remove the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [3] to safely block the
mapping(s), flush the TLB and cache, and remove the mapping(s). The VMM is
designed to be able to then map the specified GPA (range) in the shared-EPT
structure and allow the TD to access the page(s) as a shared GPA (range).
"

You can see the cache flush is guaranteed by VMM.

Btw, the use of word "may" in "host VMM may remove..." in above paragraph is
horrible.  It should use "must", just like to the "converting shared to private"
case:

"
If the Start GPA specified is a private GPA (GPA.S bit is clear), this MapGPA
TDG.VP.VMCALL can be used to help request the host VMM map the specific, private
page(s) (which mapping may involve converting the backing-physical page from a
shared page to a private page). As intended in this case, the VMM must unmap the
GPA from the shared-EPT region and invalidate the TLB and caches for the TD
vcpus to help ensure no stale mappings and cache contents exist.
"

As you can see "must" is used in "the VMM must unmap the GPA from the shared-EPT
...".

So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
page (or couple of pages) as shared on-demand, like below:

	page = alloc_page();

	addr = vmap(page,  pgprot_decrypted(PAGE_KERNEL));

	clflush_cache_range(page_address(page), PAGE_SIZE);

	MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);

And we can even avoid above clflush_cache_range() if I understand correctly.

Or  I missed something?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09  3:37                       ` Kai Huang
@ 2022-05-09 12:09                         ` Kirill A. Shutemov
  2022-05-09 14:14                           ` Dave Hansen
  2022-05-09 23:54                           ` Kai Huang
  0 siblings, 2 replies; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-09 12:09 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kirill A. Shutemov, Dave Hansen, Sathyanarayanan Kuppuswamy,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > Kirill, what's your opinion?
> > 
> > I said before that I think DMA API is the right tool here.
> > 
> > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > is conceptually the same use-case: communicate with the host.
> 
> Virtio is designed for device driver to use, so it's fine to use DMA API. And
> real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
> assumption.

Whether attestation driver uses struct device is implementation detail.
I don't see what is you point.

> DMA API has it's limitations.

Could you elaborate here?


> So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
> page (or couple of pages) as shared on-demand, like below:
> 
> 	page = alloc_page();
> 
> 	addr = vmap(page,  pgprot_decrypted(PAGE_KERNEL));
> 
> 	clflush_cache_range(page_address(page), PAGE_SIZE);
> 
> 	MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
> 
> And we can even avoid above clflush_cache_range() if I understand correctly.
> 
> Or  I missed something?

For completeness, cover free path too. Are you going to opencode page
accept too?

Private->Shared conversion is destructive. You have to split SEPT, flush
TLB. Backward conversion even more costly.

Rule of thumb is avoid conversion where possible. DMA API is there for
you.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 12:09                         ` Kirill A. Shutemov
@ 2022-05-09 14:14                           ` Dave Hansen
  2022-05-09 15:35                             ` Kirill A. Shutemov
  2022-05-09 23:54                           ` Kai Huang
  1 sibling, 1 reply; 56+ messages in thread
From: Dave Hansen @ 2022-05-09 14:14 UTC (permalink / raw)
  To: Kirill A. Shutemov, Kai Huang
  Cc: Kirill A. Shutemov, Sathyanarayanan Kuppuswamy, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Tony Luck, Andi Kleen, Wander Lairson Costa, Isaku Yamahata,
	marcelo.cerri, tim.gardner, khalid.elmously, philip.cox,
	linux-kernel

On 5/9/22 05:09, Kirill A. Shutemov wrote:
> Private->Shared conversion is destructive. You have to split SEPT, flush
> TLB. Backward conversion even more costly.
> 
> Rule of thumb is avoid conversion where possible. DMA API is there for
> you.

Kirill, I understand that the DMA API is a quick fix today.  But is it
_really_ the right long-term interface?

There will surely come a time when TDX I/O devices won't be using fixed
bounce buffers.  What will the quote generation code do then?  How will
we know to come back around and fix this up?

Does SEV or the s390 ultravisor need anything like this?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 14:14                           ` Dave Hansen
@ 2022-05-09 15:35                             ` Kirill A. Shutemov
  2022-05-09 15:43                               ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-09 15:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Kai Huang, Sathyanarayanan Kuppuswamy,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Mon, May 09, 2022 at 07:14:20AM -0700, Dave Hansen wrote:
> On 5/9/22 05:09, Kirill A. Shutemov wrote:
> > Private->Shared conversion is destructive. You have to split SEPT, flush
> > TLB. Backward conversion even more costly.
> > 
> > Rule of thumb is avoid conversion where possible. DMA API is there for
> > you.
> 
> Kirill, I understand that the DMA API is a quick fix today.  But is it
> _really_ the right long-term interface?

Yes, I think so.

> There will surely come a time when TDX I/O devices won't be using fixed
> bounce buffers.  What will the quote generation code do then?  How will
> we know to come back around and fix this up?

VirtIO will not go away with TDX I/O in picture. TDX I/O will be addition
to existing stuff, not replacement.

And we have hooks in place to accommodate this: force_dma_unencrypted()
will return false for devices capable of TDX I/O. While the rest of
devices, including VirtIO and attestation, keep using existing paths with
swiotlb.

> Does SEV or the s390 ultravisor need anything like this?

At quick glance sev-guest.c uses set_memory_decrypted()/encrypted() for
allocation and freeing shared memory. I consider it inferior to using DMA
API.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 15:35                             ` Kirill A. Shutemov
@ 2022-05-09 15:43                               ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-09 15:43 UTC (permalink / raw)
  To: Kirill A. Shutemov, Dave Hansen
  Cc: Kirill A. Shutemov, Kai Huang, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel



On 5/9/22 8:35 AM, Kirill A. Shutemov wrote:
> On Mon, May 09, 2022 at 07:14:20AM -0700, Dave Hansen wrote:
>> On 5/9/22 05:09, Kirill A. Shutemov wrote:
>>> Private->Shared conversion is destructive. You have to split SEPT, flush
>>> TLB. Backward conversion even more costly.
>>>
>>> Rule of thumb is avoid conversion where possible. DMA API is there for
>>> you.
>>
>> Kirill, I understand that the DMA API is a quick fix today.  But is it
>> _really_ the right long-term interface?
> 
> Yes, I think so.
> 
>> There will surely come a time when TDX I/O devices won't be using fixed
>> bounce buffers.  What will the quote generation code do then?  How will
>> we know to come back around and fix this up?
> 
> VirtIO will not go away with TDX I/O in picture. TDX I/O will be addition
> to existing stuff, not replacement.
> 
> And we have hooks in place to accommodate this: force_dma_unencrypted()
> will return false for devices capable of TDX I/O. While the rest of
> devices, including VirtIO and attestation, keep using existing paths with
> swiotlb.
> 
>> Does SEV or the s390 ultravisor need anything like this?
> 
> At quick glance sev-guest.c uses set_memory_decrypted()/encrypted() for
> allocation and freeing shared memory. I consider it inferior to using DMA
> API.

Following is the link for the SEV attestation driver. It does seem to
use alloc_pages() and set_memory_*() calls.

https://lore.kernel.org/lkml/20220307215344.2799259-1-brijesh.singh@amd.com/

> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 12:09                         ` Kirill A. Shutemov
  2022-05-09 14:14                           ` Dave Hansen
@ 2022-05-09 23:54                           ` Kai Huang
  2022-05-10  0:17                             ` Sathyanarayanan Kuppuswamy
                                               ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-09 23:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Sathyanarayanan Kuppuswamy,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > Kirill, what's your opinion?
> > > 
> > > I said before that I think DMA API is the right tool here.
> > > 
> > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > is conceptually the same use-case: communicate with the host.
> > 
> > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
> > assumption.
> 
> Whether attestation driver uses struct device is implementation detail.
> I don't see what is you point.

No real DMA is involved in attestation.

> 
> > So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
> > page (or couple of pages) as shared on-demand, like below:
> > 
> > 	page = alloc_page();
> > 
> > 	addr = vmap(page,  pgprot_decrypted(PAGE_KERNEL));
> > 
> > 	clflush_cache_range(page_address(page), PAGE_SIZE);
> > 
> > 	MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
> > 
> > And we can even avoid above clflush_cache_range() if I understand correctly.
> > 
> > Or  I missed something?
> 
> For completeness, cover free path too. Are you going to opencode page
> accept too?

Call __tdx_module_call(TDX_ACCEPT_PAGE, ...) right after MapGPA() to convert
back to private.  I don't think there is any problem?

> 
> Private->Shared conversion is destructive. You have to split SEPT, flush
> TLB. Backward conversion even more costly.

I think I won't call it destructive.

And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
which is large enough to cover all requests for now, during driver
initialization.  This avoids IOCTL time conversion.  We should still have code
in the IOCTL to check the request buffer size and when it is larger than the
default, the old should be freed a larger one should be allocated.  But for now
this code path will never happen.

Btw above is based on assumption that we don't support concurrent IOCTLs.  This
version Sathya somehow changed to support concurrent IOCTLs but this was a
surprise as I thought we somehow agreed we don't need to support this.

Anyway, now I don't have strong opinion here.  To me alloc_pages() +
set_memory_decrypted() is also fine (seems AMD is using this anyway).   Will let
Dave to decide.

-- 
Thanks,
-Kai

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 23:54                           ` Kai Huang
@ 2022-05-10  0:17                             ` Sathyanarayanan Kuppuswamy
  2022-05-10  1:30                             ` Kirill A. Shutemov
  2022-05-10 10:42                             ` Kai Huang
  2 siblings, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-10  0:17 UTC (permalink / raw)
  To: Kai Huang, Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel



On 5/9/22 4:54 PM, Kai Huang wrote:
> And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
> which is large enough to cover all requests for now, during driver
> initialization.  This avoids IOCTL time conversion.  We should still have code
> in the IOCTL to check the request buffer size and when it is larger than the
> default, the old should be freed a larger one should be allocated.  But for now
> this code path will never happen.
> 
> Btw above is based on assumption that we don't support concurrent IOCTLs.  This
> version Sathya somehow changed to support concurrent IOCTLs but this was a
> surprise as I thought we somehow agreed we don't need to support this.

Yes. Initially, I did not want to add this support to keep the code
simple. But recent changes (to handle cases like, cleanup the VMM
owned page after user prematurely ends the current request, and to
support VMAP model) already made the code complicated. With current
framework changes, since it is easy to extend the code to support
concurrent GetQuote requests, I have just enabled this support.


> 
> Anyway, now I don't have strong opinion here.  To me alloc_pages() +
> set_memory_decrypted() is also fine (seems AMD is using this anyway).   Will let
> Dave to decide.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 23:54                           ` Kai Huang
  2022-05-10  0:17                             ` Sathyanarayanan Kuppuswamy
@ 2022-05-10  1:30                             ` Kirill A. Shutemov
  2022-05-10  1:40                               ` Kai Huang
  2022-05-10 10:42                             ` Kai Huang
  2 siblings, 1 reply; 56+ messages in thread
From: Kirill A. Shutemov @ 2022-05-10  1:30 UTC (permalink / raw)
  To: Kai Huang
  Cc: Kirill A. Shutemov, Dave Hansen, Sathyanarayanan Kuppuswamy,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, May 10, 2022 at 11:54:12AM +1200, Kai Huang wrote:
> On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > > Kirill, what's your opinion?
> > > > 
> > > > I said before that I think DMA API is the right tool here.
> > > > 
> > > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > > is conceptually the same use-case: communicate with the host.
> > > 
> > > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > > real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
> > > assumption.
> > 
> > Whether attestation driver uses struct device is implementation detail.
> > I don't see what is you point.
> 
> No real DMA is involved in attestation.

As with VirtIO. So what?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-10  1:30                             ` Kirill A. Shutemov
@ 2022-05-10  1:40                               ` Kai Huang
  0 siblings, 0 replies; 56+ messages in thread
From: Kai Huang @ 2022-05-10  1:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Sathyanarayanan Kuppuswamy,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, 2022-05-10 at 04:30 +0300, Kirill A. Shutemov wrote:
> On Tue, May 10, 2022 at 11:54:12AM +1200, Kai Huang wrote:
> > On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > > > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > > > Kirill, what's your opinion?
> > > > > 
> > > > > I said before that I think DMA API is the right tool here.
> > > > > 
> > > > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > > > is conceptually the same use-case: communicate with the host.
> > > > 
> > > > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > > > real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
> > > > assumption.
> > > 
> > > Whether attestation driver uses struct device is implementation detail.
> > > I don't see what is you point.
> > 
> > No real DMA is involved in attestation.
> 
> As with VirtIO. So what?
> 

No real DMA can happen to virtio buffer.  Consider DPDK which uses virtio +
vhost-user.  But I don't want to argue anymore about this topic. :)

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-09 23:54                           ` Kai Huang
  2022-05-10  0:17                             ` Sathyanarayanan Kuppuswamy
  2022-05-10  1:30                             ` Kirill A. Shutemov
@ 2022-05-10 10:42                             ` Kai Huang
  2022-05-16 17:39                               ` Sathyanarayanan Kuppuswamy
  2 siblings, 1 reply; 56+ messages in thread
From: Kai Huang @ 2022-05-10 10:42 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Sathyanarayanan Kuppuswamy,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H . Peter Anvin, Tony Luck, Andi Kleen, Wander Lairson Costa,
	Isaku Yamahata, marcelo.cerri, tim.gardner, khalid.elmously,
	philip.cox, linux-kernel

On Tue, 2022-05-10 at 11:54 +1200, Kai Huang wrote:
> On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > > Kirill, what's your opinion?
> > > > 
> > > > I said before that I think DMA API is the right tool here.
> > > > 
> > > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > > is conceptually the same use-case: communicate with the host.
> > > 
> > > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > > real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
> > > assumption.
> > 
> > Whether attestation driver uses struct device is implementation detail.
> > I don't see what is you point.
> 
> No real DMA is involved in attestation.
> 
> > 
> > > So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
> > > page (or couple of pages) as shared on-demand, like below:
> > > 
> > > 	page = alloc_page();
> > > 
> > > 	addr = vmap(page,  pgprot_decrypted(PAGE_KERNEL));
> > > 
> > > 	clflush_cache_range(page_address(page), PAGE_SIZE);
> > > 
> > > 	MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
> > > 
> > > And we can even avoid above clflush_cache_range() if I understand correctly.
> > > 
> > > Or  I missed something?
> > 
> > For completeness, cover free path too. Are you going to opencode page
> > accept too?
> 
> Call __tdx_module_call(TDX_ACCEPT_PAGE, ...) right after MapGPA() to convert
> back to private.  I don't think there is any problem?
> 
> > 
> > Private->Shared conversion is destructive. You have to split SEPT, flush
> > TLB. Backward conversion even more costly.
> 
> I think I won't call it destructive.
> 
> And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
> which is large enough to cover all requests for now, during driver
> initialization.  This avoids IOCTL time conversion.  We should still have code
> in the IOCTL to check the request buffer size and when it is larger than the
> default, the old should be freed a larger one should be allocated.  But for now
> this code path will never happen.
> 
> Btw above is based on assumption that we don't support concurrent IOCTLs.  This
> version Sathya somehow changed to support concurrent IOCTLs but this was a
> surprise as I thought we somehow agreed we don't need to support this.

Hi Dave,

Sorry I forgot to mention that GHCI 1.5 defines a generic TDVMCALL<Service> for
a TD to communicate with VMM or another TD or some service in the host.  This
TDVMCALL can support many sub-commands.  For now only sub-commands for TD
migration is defined, but we can have more.

For this, we cannot assume the size of the command buffer, and I don't see why
we don't want to support concurrent TDVMCALLs.  So looks from long term, we will
very likely need IOCTL time buffer private-shared conversion.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support
  2022-05-10 10:42                             ` Kai Huang
@ 2022-05-16 17:39                               ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 56+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-05-16 17:39 UTC (permalink / raw)
  To: Kai Huang, Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Tony Luck,
	Andi Kleen, Wander Lairson Costa, Isaku Yamahata, marcelo.cerri,
	tim.gardner, khalid.elmously, philip.cox, linux-kernel

Hi Dave,

On 5/10/22 3:42 AM, Kai Huang wrote:
> On Tue, 2022-05-10 at 11:54 +1200, Kai Huang wrote:
>> On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
>>> On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
>>>> On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
>>>>> On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
>>>>>> Kirill, what's your opinion?
>>>>>
>>>>> I said before that I think DMA API is the right tool here.
>>>>>
>>>>> Speculation about future of DMA in TDX is irrelevant here. If semantics
>>>>> change we will need to re-evaluate all users. VirtIO uses DMA API and it
>>>>> is conceptually the same use-case: communicate with the host.
>>>>
>>>> Virtio is designed for device driver to use, so it's fine to use DMA API. And
>>>> real DMA can happen to the virtio DMA buffers.  Attestation doesn't have such
>>>> assumption.
>>>
>>> Whether attestation driver uses struct device is implementation detail.
>>> I don't see what is you point.
>>
>> No real DMA is involved in attestation.
>>
>>>
>>>> So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
>>>> page (or couple of pages) as shared on-demand, like below:
>>>>
>>>> 	page = alloc_page();
>>>>
>>>> 	addr = vmap(page,  pgprot_decrypted(PAGE_KERNEL));
>>>>
>>>> 	clflush_cache_range(page_address(page), PAGE_SIZE);
>>>>
>>>> 	MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
>>>>
>>>> And we can even avoid above clflush_cache_range() if I understand correctly.
>>>>
>>>> Or  I missed something?
>>>
>>> For completeness, cover free path too. Are you going to opencode page
>>> accept too?
>>
>> Call __tdx_module_call(TDX_ACCEPT_PAGE, ...) right after MapGPA() to convert
>> back to private.  I don't think there is any problem?
>>
>>>
>>> Private->Shared conversion is destructive. You have to split SEPT, flush
>>> TLB. Backward conversion even more costly.
>>
>> I think I won't call it destructive.
>>
>> And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
>> which is large enough to cover all requests for now, during driver
>> initialization.  This avoids IOCTL time conversion.  We should still have code
>> in the IOCTL to check the request buffer size and when it is larger than the
>> default, the old should be freed a larger one should be allocated.  But for now
>> this code path will never happen.
>>
>> Btw above is based on assumption that we don't support concurrent IOCTLs.  This
>> version Sathya somehow changed to support concurrent IOCTLs but this was a
>> surprise as I thought we somehow agreed we don't need to support this.
> 
> Hi Dave,
> 
> Sorry I forgot to mention that GHCI 1.5 defines a generic TDVMCALL<Service> for
> a TD to communicate with VMM or another TD or some service in the host.  This
> TDVMCALL can support many sub-commands.  For now only sub-commands for TD
> migration is defined, but we can have more.
> 
> For this, we cannot assume the size of the command buffer, and I don't see why
> we don't want to support concurrent TDVMCALLs.  So looks from long term, we will
> very likely need IOCTL time buffer private-shared conversion.
> 
> 


Let me summarize the discussion so far.

Problem: Allocate shared buffer without breaking the direct map.

Solution 1: Use alloc_pages*()/vmap()/set_memory_*crypted() APIs

Pros/Cons:

1. Uses virtual mapped address for shared/private conversion and
    hence does not touch the direct mapping.

2. Current version of set_memory_*crypted() APIs  modifies the
    aliased mappings, which also includes the direct mapping. So if we
    want to use set_memory_*() APIs, we need a new variant that does not
    touch the direct mapping. An alternative solution is to open code the
    page attribute conversion, cache flushing and MapGpa/Page acceptance
    logic in the attestation driver itself. But, IMO, this is not
    preferred because it is not favorable to sprinkle the mapping
    conversion code in multiple places in the kernel. It is better to use
    a single API if possible.

3. This solution can possibly break the SEPT entries on private-shared
    conversion. The backward conversion is also costly. IMO, since the
    attestation requests are not very frequent, we don't need to be
    overly concerned about the cost involved in the conversion.

Solution 2: Use DMA alloc APIs.

Pros/Cons:

1. Simpler to use. It taps into the SWIOTLB buffers and does not
    affect the direct map. Since we will be using already converted
    memory, allocation/freeing will be cheaper compared to solution 1.

2. There is a concern that it is not a long term solution. Since
    with advent of TDX IO, not all DMA allocations need to use
    SWIOTLB model. But as per Kirill's comments, this is not a concern
    and force_dma_unencrypted() hook can be used to differentiate which
    devices need to use TDX IO vs SWIOTLB model.

3. Using DMA APIs requires a valid bus device as argument and hence
    requires this driver converted into a platform device driver. But,
    since this driver does not do real DMA, making above changes just
    to use the DMA API is not recommended.

Since both solutions fix the problem (and there are pros/cons), and both
Kai/Kirill's comments conclusion is, there is no hard preference and
to let you decide on it.

Since you have already made a comment about "irrespective of which
model is chosen, you need the commit log talk about the solution and
how it not touches the direct map", I have posted the v6 version
adapting Solution 1.

Please let me know if you agree with this direction or have comments
about the solution.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2022-05-16 17:39 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-01 18:34 [PATCH v5 0/3] Add TDX Guest Attestation support Kuppuswamy Sathyanarayanan
2022-05-01 18:34 ` [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver Kuppuswamy Sathyanarayanan
2022-05-02  2:31   ` Kai Huang
2022-05-02 15:52     ` Sathyanarayanan Kuppuswamy
2022-05-02 22:30       ` Kai Huang
2022-05-02 23:17         ` Sathyanarayanan Kuppuswamy
2022-05-02 23:37           ` Kai Huang
2022-05-03 14:38           ` Wander Costa
2022-05-03 15:09             ` Sathyanarayanan Kuppuswamy
2022-05-03 22:08               ` Kai Huang
2022-05-02 12:18   ` Wander Lairson Costa
2022-05-02 16:06     ` Sathyanarayanan Kuppuswamy
2022-05-01 18:34 ` [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support Kuppuswamy Sathyanarayanan
2022-05-02 12:44   ` Wander Lairson Costa
2022-05-01 18:35 ` [PATCH v5 3/3] x86/tdx: Add Quote generation support Kuppuswamy Sathyanarayanan
2022-05-02  2:40   ` Kai Huang
2022-05-03  1:27     ` Kirill A. Shutemov
2022-05-03  2:18       ` Kai Huang
2022-05-03  2:39         ` Sathyanarayanan Kuppuswamy
2022-05-03 22:13           ` Kai Huang
2022-05-03  2:45         ` Kirill A. Shutemov
2022-05-03  3:36           ` Kai Huang
2022-05-03 22:24       ` Dave Hansen
2022-05-03 22:28         ` Sathyanarayanan Kuppuswamy
2022-05-03 22:30           ` Sathyanarayanan Kuppuswamy
2022-05-04 22:49         ` Sathyanarayanan Kuppuswamy
2022-05-04 23:28           ` Kai Huang
2022-05-05 20:53             ` Sathyanarayanan Kuppuswamy
2022-05-05 22:15               ` Kai Huang
2022-05-05 22:38                 ` Sathyanarayanan Kuppuswamy
2022-05-05 23:06                 ` Dave Hansen
2022-05-06  0:11                   ` Kai Huang
2022-05-06  1:55                     ` Sathyanarayanan Kuppuswamy
2022-05-07  0:42                     ` Kirill A. Shutemov
2022-05-09  3:37                       ` Kai Huang
2022-05-09 12:09                         ` Kirill A. Shutemov
2022-05-09 14:14                           ` Dave Hansen
2022-05-09 15:35                             ` Kirill A. Shutemov
2022-05-09 15:43                               ` Sathyanarayanan Kuppuswamy
2022-05-09 23:54                           ` Kai Huang
2022-05-10  0:17                             ` Sathyanarayanan Kuppuswamy
2022-05-10  1:30                             ` Kirill A. Shutemov
2022-05-10  1:40                               ` Kai Huang
2022-05-10 10:42                             ` Kai Huang
2022-05-16 17:39                               ` Sathyanarayanan Kuppuswamy
2022-05-06 11:00                   ` Kai Huang
2022-05-06 15:47                     ` Dave Hansen
2022-05-07  1:00                     ` Kirill A. Shutemov
2022-05-05 10:50           ` Kai Huang
2022-05-05 19:03             ` Sathyanarayanan Kuppuswamy
2022-05-05 22:25               ` Kai Huang
2022-05-02  5:01   ` Kai Huang
2022-05-02 16:02     ` Sathyanarayanan Kuppuswamy
2022-05-02 23:02       ` Kai Huang
2022-05-02 13:16   ` Wander Lairson Costa
2022-05-02 16:05     ` Sathyanarayanan Kuppuswamy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).