linux-sgx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior
@ 2022-09-20  6:39 Zhiquan Li
  2022-09-20  6:39 ` [PATCH v9 1/3] x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner Zhiquan Li
                   ` (4 more replies)
  0 siblings, 5 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-09-20  6:39 UTC (permalink / raw)
  To: linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, zhiquan1.li

V8: https://lore.kernel.org/linux-sgx/20220913145330.2998212-1-zhiquan1.li@intel.com/T/#t

Changes since V8:
- Remove excess Acked-by from patch 02 and 03.

V7: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#t

Changes since V7:
- Enrich the motivation for renaming in commit message of patch 01 with
  the explanation from Kai.
- Add Acked-by from Jarkko.
  Link: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#mc1c93e7d9643588b27cefa9540f988a070469b5b
- Add Acked-by from Kai Huang at patch 01.

V6: https://lore.kernel.org/linux-sgx/20220826160503.1576966-1-zhiquan1.li@intel.com/T/#t

Changes since V6:
- Revise the commit message of patch 01 suggested by Jarkko.
- Fix build warning due to type changes.

V5: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#t

Changes since V5:
- Rename the 'owner' field as 'encl_owner' and update the references
  as a separate patch.
- To prevent casting the 'encl_owner' field, introduce a union with
  another field - "vepc_vaddr", suggested by Dave Hansen.
- Clean up the commit message of patch 02 suggested by Dave Hansen.
- Remove patch 03 unless we have better reason to keep it.
- Add Reviewed-by from Jarkko.
  Link: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m379d00fc7f1d43726a42b3884637532061a8c0d1

V4: https://lore.kernel.org/linux-sgx/20220608032654.1764936-1-zhiquan1.li@intel.com/T/#t

Changes since V4:
- Switch the order of the two variables at patch 02 so all of variables
  are in reverse Christmas style.
- Do not initialize 'ret' because it will be overridden by the return
  value of force_sig_mceerr() unconditionally.
- Add Co-developed-by and Signed-off-by from Cathy Zhang at patch 01.
- Add Acked-by from Kai Huang at patch 01.

V3: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#t

Changes since V3:
- Take the definition of EPC page flag SGX_EPC_PAGE_KVM_GUEST from
  Cathy Zhang's third patch of SGX rebootless recovery patch set but
  discard irrelevant portion, since it might need some time to re-forge
  and these are two different features.
  Link: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#m9782d23496cacecb7da07a67daa79f4b322ae170

V2: https://lore.kernel.org/linux-sgx/694234d7-6a0d-e85f-f2f9-e52b4a61e1ec@intel.com/T/#t

Changes since V2:
- Repurpose the owner field as the virtual address of virtual EPC page
- Remove struct sgx_vepc_page and relevant code.
- Remove patch 01 as the changes are not necessary in new design.
- Rework patch 02 suggested by Jarkko.
- Adapt patch 03 and 04 since struct sgx_vepc_page was discarded.
- Replace EPC page flag SGX_EPC_PAGE_IS_VEPC with
  SGX_EPC_PAGE_KVM_GUEST as they are duplicated.
  Link: https://lore.kernel.org/linux-sgx/eb95b32ecf3d44a695610cf7f2816785@intel.com/T/#u

V1: https://lore.kernel.org/linux-sgx/443cb425-009c-2784-56f4-5e707122de76@intel.com/T/#t

Changes since V1:
- Updated cover letter and commit messages, added valuable
  information from Jarkko, Tony and Kai's comments.
- Added documentations for struct struct sgx_vepc and
  struct sgx_vepc_page.

Hi everyone,

This series contains a few patches to fine grained SGX MCA behavior.
Today, if a guest accesses an SGX EPC page with memory failure,
the kernel behavior will kill the entire guest.  This blast radius is
too large.  It would be idea to kill only the SGX application inside
the guest.

To fix this, send a SIGBUS to host userspace (like QEMU) which can
follow up by injecting a #MC to the guest.
However, when a page triggers a machine check, it only reports the
PFN.  But in order to inject #MC into hypervisor, the virtual address
is required.  The 'encl_owner' field is useless in virtualization
case, then repurpose it as 'vepc_vaddr' - the virtual address of the
virtual EPC page for such case so that arch_memory_failure() can easily
retrieve it.

Suppose an enclave is shared by multiple processes, when an enclave
page triggers a machine check, the enclave will be disabled so that
it couldn't be entered again.  Killing other processes with the same
enclave mapped would perhaps be overkill, but they are going to find
that the enclave is "dead" next time they try to use it.  Thanks for
Jarkko’s head up and Tony’s clarification on this point.
Unlike host enclaves, virtual EPC instance cannot be shared by multiple
VMs. It is because how enclaves are created is totally up to the guest.
Sharing virtual EPC instance will be very likely to unexpectedly break
enclaves in all VMs.

SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
being shared by multiple VMs via fork(). However KVM doesn't support
running a VM across multiple mm structures, and the de facto userspace
hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
this should not happen.

This series is based on tip/x86/sgx.

Tests:
1. MCE injection test for SGX in VM.
   As we expected, the application was killed and VM was alive.
2. Kernel selftest/sgx: PASS
3. Internal SGX stress test: PASS
4. kmemleak test: No memory leakage detected.

Much appreciate your feedback.

Best Regards,
Zhiquan

Zhiquan Li (3):
  x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner
  x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  x86/sgx: Fine grained SGX MCA behavior for virtualization

 arch/x86/kernel/cpu/sgx/main.c | 48 +++++++++++++++++++++++++---------
 arch/x86/kernel/cpu/sgx/sgx.h  |  8 +++++-
 arch/x86/kernel/cpu/sgx/virt.c |  4 ++-
 3 files changed, 46 insertions(+), 14 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v9 1/3] x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner
  2022-09-20  6:39 [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
@ 2022-09-20  6:39 ` Zhiquan Li
  2022-09-20  6:39 ` [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case Zhiquan Li
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-09-20  6:39 UTC (permalink / raw)
  To: linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, zhiquan1.li

In order to send SIGBUS to userspace hypervisor to allow it to
inject #MC to guest, use virtual EPC page's owner to be the userspace
virtual address of the EPC page.  To avoid casting, use a union to
separate the use of owner for SGX driver EPC page and virtual EPC page
in the next step.

To pave the way, rename owner of SGX driver EPC page to 'encl_owner' to
be more specific and update all of references.  There is no functional
change.

Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Acked-by: Jarkko Sakkinen <jarkko@kernel.org>
Acked-by: Kai Huang <kai.huang@intel.com>

---
No changes since V8.

Changes since V7:
- Enrich the motivation for renaming in commit message with the
  explanation from Kai.
  Link: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#me02a2ce0f3cc0122e62dac496d89321d1c006807
- Add Acked-by from Jarkko.
- Add Acked-by from Kai Huang.

Changes since V6:
- Revise the commit message suggested by Jarkko.
  Link: https://lore.kernel.org/linux-sgx/20220826160503.1576966-1-zhiquan1.li@intel.com/T/#mb201506ed06932438c82d48915cd4ceae9745bc2
---
 arch/x86/kernel/cpu/sgx/main.c | 20 ++++++++++----------
 arch/x86/kernel/cpu/sgx/sgx.h  |  2 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 515e2a5f25bb..1315c69a733e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -102,7 +102,7 @@ static void __sgx_sanitize_pages(struct list_head *dirty_page_list)
 
 static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_owner;
 	struct sgx_encl *encl = page->encl;
 	struct sgx_encl_mm *encl_mm;
 	bool ret = true;
@@ -134,7 +134,7 @@ static bool sgx_reclaimer_age(struct sgx_epc_page *epc_page)
 
 static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
 {
-	struct sgx_encl_page *page = epc_page->owner;
+	struct sgx_encl_page *page = epc_page->encl_owner;
 	unsigned long addr = page->desc & PAGE_MASK;
 	struct sgx_encl *encl = page->encl;
 	int ret;
@@ -191,7 +191,7 @@ void sgx_ipi_cb(void *info)
 static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 			 struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_owner;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_va_page *va_page;
 	unsigned int va_offset;
@@ -244,7 +244,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
 static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
 				struct sgx_backing *backing)
 {
-	struct sgx_encl_page *encl_page = epc_page->owner;
+	struct sgx_encl_page *encl_page = epc_page->encl_owner;
 	struct sgx_encl *encl = encl_page->encl;
 	struct sgx_backing secs_backing;
 	int ret;
@@ -306,7 +306,7 @@ static void sgx_reclaim_pages(void)
 		epc_page = list_first_entry(&sgx_active_page_list,
 					    struct sgx_epc_page, list);
 		list_del_init(&epc_page->list);
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_owner;
 
 		if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
 			chunk[cnt++] = epc_page;
@@ -320,7 +320,7 @@ static void sgx_reclaim_pages(void)
 
 	for (i = 0; i < cnt; i++) {
 		epc_page = chunk[i];
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_owner;
 
 		if (!sgx_reclaimer_age(epc_page))
 			goto skip;
@@ -359,7 +359,7 @@ static void sgx_reclaim_pages(void)
 		if (!epc_page)
 			continue;
 
-		encl_page = epc_page->owner;
+		encl_page = epc_page->encl_owner;
 		sgx_reclaimer_write(epc_page, &backing[i]);
 
 		kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -560,7 +560,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
 	for ( ; ; ) {
 		page = __sgx_alloc_epc_page();
 		if (!IS_ERR(page)) {
-			page->owner = owner;
+			page->encl_owner = owner;
 			break;
 		}
 
@@ -603,7 +603,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
 
 	spin_lock(&node->lock);
 
-	page->owner = NULL;
+	page->encl_owner = NULL;
 	if (page->poison)
 		list_add(&page->list, &node->sgx_poison_page_list);
 	else
@@ -638,7 +638,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
 	for (i = 0; i < nr_pages; i++) {
 		section->pages[i].section = index;
 		section->pages[i].flags = 0;
-		section->pages[i].owner = NULL;
+		section->pages[i].encl_owner = NULL;
 		section->pages[i].poison = 0;
 		list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
 	}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0f2020653fba..4d88abccd12e 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -33,7 +33,7 @@ struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *owner;
+	struct sgx_encl_page *encl_owner;
 	struct list_head list;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  2022-09-20  6:39 [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
  2022-09-20  6:39 ` [PATCH v9 1/3] x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner Zhiquan Li
@ 2022-09-20  6:39 ` Zhiquan Li
  2022-10-10 23:10   ` Dave Hansen
  2022-09-20  6:39 ` [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization Zhiquan Li
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Zhiquan Li @ 2022-09-20  6:39 UTC (permalink / raw)
  To: linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, zhiquan1.li

When a page triggers a machine check, it only reports the PFN.  But in
order to inject #MC into hypervisor, the virtual address is required.
The 'encl_owner' field is useless in virtualization case, then
repurpose it as 'vepc_vaddr' - the virtual address of the virtual EPC
page for such case so that arch_memory_failure() can easily retrieve it.

Introduce a union to prevent adding a new dedicated structure to
track the virtual address of virtual EPC page.  And it can also prevent
playing the casting games while using it.

Add a new EPC page flag - SGX_EPC_PAGE_KVM_GUEST to interpret the
meaning of the field.

Co-developed-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

---
Changes since V8:
- Remove excess Acked-by.

Changes since V7:
- Add Acked-by from Jarkko.

No changes since V6.

Changes since V5:
- To prevent casting the 'encl_owner' field, introduce a union with
  another field - 'vepc_vaddr', sugguested by Dave Hansen.
- Add Reviewed-by from Jarkko.
  Link: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m379d00fc7f1d43726a42b3884637532061a8c0d1

Changes since V4:
- Add Co-developed-by and Signed-off-by from Cathy Zhang, as she had
  fully discussed the flag name with Jarkko.
  Link: https://lore.kernel.org/all/df92395ade424401ac3c6322de568720@intel.com/
- Add Acked-by from Kai Huang
  Link: https://lore.kernel.org/linux-sgx/0676cd4e-d94b-e904-81ae-ca1c05d37070@intel.com/T/#mccfb11df30698dbd060f2b6f06383cda7f154ef3

Changes since V3:
- Take the definition of EPC page flag SGX_EPC_PAGE_KVM_GUEST from
  Cathy Zhang's third patch of SGX rebootless recovery patch set but
  discard irrelevant portion, since it might need some time to
  re-forge and these are two different features.
  Link: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#m9782d23496cacecb7da07a67daa79f4b322ae170

Changes since V2:
- Remove struct sgx_vepc_page and relevant code.
- Rework the patch suggested by Jarkko.
- Remove new EPC page flag SGX_EPC_PAGE_IS_VEPC definition as it is
  duplicated to SGX_EPC_PAGE_KVM_GUEST.
  Link: https://lore.kernel.org/linux-sgx/eb95b32ecf3d44a695610cf7f2816785@intel.com/T/#u

Changes since V1:
- Add documentation suggested by Jarkko.
---
 arch/x86/kernel/cpu/sgx/main.c | 4 ++++
 arch/x86/kernel/cpu/sgx/sgx.h  | 8 +++++++-
 arch/x86/kernel/cpu/sgx/virt.c | 4 +++-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 1315c69a733e..b319bedcaf1e 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -549,6 +549,10 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
  * Finally, wake up ksgxd when the number of pages goes below the watermark
  * before returning back to the caller.
  *
+ * When an EPC page is assigned to KVM guest, repurpose the 'encl_owner' field
+ * as the virtual address of virtual EPC page, since it is useless in such
+ * scenario, so 'owner' is assigned to 'vepc_vaddr'.
+ *
  * Return:
  *   an EPC page,
  *   -errno on error
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 4d88abccd12e..d16a8baa28d4 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -28,12 +28,18 @@
 
 /* Pages on free list */
 #define SGX_EPC_PAGE_IS_FREE		BIT(1)
+/* Pages allocated for KVM guest */
+#define SGX_EPC_PAGE_KVM_GUEST		BIT(2)
 
 struct sgx_epc_page {
 	unsigned int section;
 	u16 flags;
 	u16 poison;
-	struct sgx_encl_page *encl_owner;
+	union {
+		struct sgx_encl_page *encl_owner;
+		/* Use when SGX_EPC_PAGE_KVM_GUEST set in ->flags: */
+		void __user *vepc_vaddr;
+	};
 	struct list_head list;
 };
 
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 6a77a14eee38..776ae5c1c032 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -46,10 +46,12 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
 	if (epc_page)
 		return 0;
 
-	epc_page = sgx_alloc_epc_page(vepc, false);
+	epc_page = sgx_alloc_epc_page((void *)addr, false);
 	if (IS_ERR(epc_page))
 		return PTR_ERR(epc_page);
 
+	epc_page->flags |= SGX_EPC_PAGE_KVM_GUEST;
+
 	ret = xa_err(xa_store(&vepc->page_array, index, epc_page, GFP_KERNEL));
 	if (ret)
 		goto err_free;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-09-20  6:39 [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
  2022-09-20  6:39 ` [PATCH v9 1/3] x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner Zhiquan Li
  2022-09-20  6:39 ` [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case Zhiquan Li
@ 2022-09-20  6:39 ` Zhiquan Li
  2022-10-10 23:20   ` Dave Hansen
  2022-10-11 14:04   ` Dave Hansen
  2022-09-29  8:05 ` [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
  2022-10-08  2:29 ` Zhiquan Li
  4 siblings, 2 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-09-20  6:39 UTC (permalink / raw)
  To: linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, zhiquan1.li

Today, if a guest accesses an SGX EPC page with memory failure,
the kernel behavior will kill the entire guest.  This blast
radius is too large.  It would be idea to kill only the SGX
application inside the guest.

To fix this, send a SIGBUS to host userspace (like QEMU) which can
follow up by injecting a #MC to the guest.

SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
being shared by multiple VMs via fork().  However KVM doesn't support
running a VM across multiple mm structures, and the de facto userspace
hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
this should not happen.

Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/linux-sgx/443cb425-009c-2784-56f4-5e707122de76@intel.com/T/#m1d1f4098f4fad78034e8706a60e4d79c119db407
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>

---
Changes since V8:
- Remove excess Acked-by.

Changes since V7:
- Add Acked-by from Jarkko.

Changes since V6:
- Fix build warning due to type changes.

Changes since V5:
- Use the 'vepc_vaddr' field instead of casting the 'owner' field.
- Clean up the commit message suggested by Dave.
  Link: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m2ff4778948cdc9ee65f09672f1d02f8dc467247b
- Add Reviewed-by from Jarkko.

Changes since V4:
- Switch the order of the two variables so all of variables are in
  reverse Christmas style.
- Do not initialize "ret" because it will be overridden by the return
  value of force_sig_mceerr() unconditionally.

Changes since V2:
- Retrieve virtual address from "owner" field of struct sgx_epc_page,
  instead of struct sgx_vepc_page.
- Replace EPC page flag SGX_EPC_PAGE_IS_VEPC with
  SGX_EPC_PAGE_KVM_GUEST as they are duplicated.

Changes since V1:
- Add Acked-by from Kai Huang.
- Add Kai's excellent explanation regarding to why we no need to
  consider that one virtual EPC be shared by two guests.
---
 arch/x86/kernel/cpu/sgx/main.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b319bedcaf1e..160c8dbee0ab 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -679,6 +679,8 @@ int arch_memory_failure(unsigned long pfn, int flags)
 	struct sgx_epc_page *page = sgx_paddr_to_page(pfn << PAGE_SHIFT);
 	struct sgx_epc_section *section;
 	struct sgx_numa_node *node;
+	void __user *vaddr;
+	int ret;
 
 	/*
 	 * mm/memory-failure.c calls this routine for all errors
@@ -695,8 +697,26 @@ int arch_memory_failure(unsigned long pfn, int flags)
 	 * error. The signal may help the task understand why the
 	 * enclave is broken.
 	 */
-	if (flags & MF_ACTION_REQUIRED)
-		force_sig(SIGBUS);
+	if (flags & MF_ACTION_REQUIRED) {
+		/*
+		 * Provide extra info to the task so that it can make further
+		 * decision but not simply kill it. This is quite useful for
+		 * virtualization case.
+		 */
+		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
+			/*
+			 * The 'encl_owner' field is repurposed, when allocating EPC
+			 * page it was assigned to the virtual address of virtual EPC
+			 * page.
+			 */
+			vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK);
+			ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT);
+			if (ret < 0)
+				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
+					current->comm, current->pid, ret);
+		} else
+			force_sig(SIGBUS);
+	}
 
 	section = &sgx_epc_sections[page->section];
 	node = section->node;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior
  2022-09-20  6:39 [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
                   ` (2 preceding siblings ...)
  2022-09-20  6:39 ` [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization Zhiquan Li
@ 2022-09-29  8:05 ` Zhiquan Li
  2022-10-08  2:29 ` Zhiquan Li
  4 siblings, 0 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-09-29  8:05 UTC (permalink / raw)
  To: linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, tglx, bp


On 2022/9/20 14:39, Zhiquan Li wrote:
> V8: https://lore.kernel.org/linux-sgx/20220913145330.2998212-1-zhiquan1.li@intel.com/T/#t
> 
> Changes since V8:
> - Remove excess Acked-by from patch 02 and 03.
> 
> V7: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#t
> 
> Changes since V7:
> - Enrich the motivation for renaming in commit message of patch 01 with
>   the explanation from Kai.
> - Add Acked-by from Jarkko.
>   Link: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#mc1c93e7d9643588b27cefa9540f988a070469b5b
> - Add Acked-by from Kai Huang at patch 01.
> 
> V6: https://lore.kernel.org/linux-sgx/20220826160503.1576966-1-zhiquan1.li@intel.com/T/#t
> 
> Changes since V6:
> - Revise the commit message of patch 01 suggested by Jarkko.
> - Fix build warning due to type changes.
> 
> V5: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#t
> 
> Changes since V5:
> - Rename the 'owner' field as 'encl_owner' and update the references
>   as a separate patch.
> - To prevent casting the 'encl_owner' field, introduce a union with
>   another field - "vepc_vaddr", suggested by Dave Hansen.
> - Clean up the commit message of patch 02 suggested by Dave Hansen.
> - Remove patch 03 unless we have better reason to keep it.
> - Add Reviewed-by from Jarkko.
>   Link: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m379d00fc7f1d43726a42b3884637532061a8c0d1
> 
> V4: https://lore.kernel.org/linux-sgx/20220608032654.1764936-1-zhiquan1.li@intel.com/T/#t
> 
> Changes since V4:
> - Switch the order of the two variables at patch 02 so all of variables
>   are in reverse Christmas style.
> - Do not initialize 'ret' because it will be overridden by the return
>   value of force_sig_mceerr() unconditionally.
> - Add Co-developed-by and Signed-off-by from Cathy Zhang at patch 01.
> - Add Acked-by from Kai Huang at patch 01.
> 
> V3: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#t
> 
> Changes since V3:
> - Take the definition of EPC page flag SGX_EPC_PAGE_KVM_GUEST from
>   Cathy Zhang's third patch of SGX rebootless recovery patch set but
>   discard irrelevant portion, since it might need some time to re-forge
>   and these are two different features.
>   Link: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#m9782d23496cacecb7da07a67daa79f4b322ae170
> 
> V2: https://lore.kernel.org/linux-sgx/694234d7-6a0d-e85f-f2f9-e52b4a61e1ec@intel.com/T/#t
> 
> Changes since V2:
> - Repurpose the owner field as the virtual address of virtual EPC page
> - Remove struct sgx_vepc_page and relevant code.
> - Remove patch 01 as the changes are not necessary in new design.
> - Rework patch 02 suggested by Jarkko.
> - Adapt patch 03 and 04 since struct sgx_vepc_page was discarded.
> - Replace EPC page flag SGX_EPC_PAGE_IS_VEPC with
>   SGX_EPC_PAGE_KVM_GUEST as they are duplicated.
>   Link: https://lore.kernel.org/linux-sgx/eb95b32ecf3d44a695610cf7f2816785@intel.com/T/#u
> 
> V1: https://lore.kernel.org/linux-sgx/443cb425-009c-2784-56f4-5e707122de76@intel.com/T/#t
> 
> Changes since V1:
> - Updated cover letter and commit messages, added valuable
>   information from Jarkko, Tony and Kai's comments.
> - Added documentations for struct struct sgx_vepc and
>   struct sgx_vepc_page.
> 
> Hi everyone,
> 
> This series contains a few patches to fine grained SGX MCA behavior.
> Today, if a guest accesses an SGX EPC page with memory failure,
> the kernel behavior will kill the entire guest.  This blast radius is
> too large.  It would be idea to kill only the SGX application inside
> the guest.
> 
> To fix this, send a SIGBUS to host userspace (like QEMU) which can
> follow up by injecting a #MC to the guest.
> However, when a page triggers a machine check, it only reports the
> PFN.  But in order to inject #MC into hypervisor, the virtual address
> is required.  The 'encl_owner' field is useless in virtualization
> case, then repurpose it as 'vepc_vaddr' - the virtual address of the
> virtual EPC page for such case so that arch_memory_failure() can easily
> retrieve it.
> 
> Suppose an enclave is shared by multiple processes, when an enclave
> page triggers a machine check, the enclave will be disabled so that
> it couldn't be entered again.  Killing other processes with the same
> enclave mapped would perhaps be overkill, but they are going to find
> that the enclave is "dead" next time they try to use it.  Thanks for
> Jarkko’s head up and Tony’s clarification on this point.
> Unlike host enclaves, virtual EPC instance cannot be shared by multiple
> VMs. It is because how enclaves are created is totally up to the guest.
> Sharing virtual EPC instance will be very likely to unexpectedly break
> enclaves in all VMs.
> 
> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> being shared by multiple VMs via fork(). However KVM doesn't support
> running a VM across multiple mm structures, and the de facto userspace
> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> this should not happen.
> 
> This series is based on tip/x86/sgx.
> 
> Tests:
> 1. MCE injection test for SGX in VM.
>    As we expected, the application was killed and VM was alive.
> 2. Kernel selftest/sgx: PASS
> 3. Internal SGX stress test: PASS
> 4. kmemleak test: No memory leakage detected.
> 
> Much appreciate your feedback.
> 
> Best Regards,
> Zhiquan
> 
> Zhiquan Li (3):
>   x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner
>   x86/sgx: Introduce union with vepc_vaddr field for virtualization case
>   x86/sgx: Fine grained SGX MCA behavior for virtualization
> 
>  arch/x86/kernel/cpu/sgx/main.c | 48 +++++++++++++++++++++++++---------
>  arch/x86/kernel/cpu/sgx/sgx.h  |  8 +++++-
>  arch/x86/kernel/cpu/sgx/virt.c |  4 ++-
>  3 files changed, 46 insertions(+), 14 deletions(-)
> 

Hello Jarkko,

Could this patch set be merged into tip/x86/sgx branch, since there is
no concern for a long time?

Thanks & Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior
  2022-09-20  6:39 [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
                   ` (3 preceding siblings ...)
  2022-09-29  8:05 ` [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
@ 2022-10-08  2:29 ` Zhiquan Li
  4 siblings, 0 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-10-08  2:29 UTC (permalink / raw)
  To: linux-sgx, dave.hansen, bp
  Cc: seanjc, fan.du, cathy.zhang, tony.luck, jarkko, kai.huang, tglx, bp


On 2022/9/20 14:39, Zhiquan Li wrote:
> V8: https://lore.kernel.org/linux-sgx/20220913145330.2998212-1-zhiquan1.li@intel.com/T/#t
> 
> Changes since V8:
> - Remove excess Acked-by from patch 02 and 03.
> 
> V7: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#t
> 
> Changes since V7:
> - Enrich the motivation for renaming in commit message of patch 01 with
>   the explanation from Kai.
> - Add Acked-by from Jarkko.
>   Link: https://lore.kernel.org/linux-sgx/YxEyRT2SbfBdYNfm@kernel.org/T/#mc1c93e7d9643588b27cefa9540f988a070469b5b
> - Add Acked-by from Kai Huang at patch 01.
> 
> V6: https://lore.kernel.org/linux-sgx/20220826160503.1576966-1-zhiquan1.li@intel.com/T/#t
> 
> Changes since V6:
> - Revise the commit message of patch 01 suggested by Jarkko.
> - Fix build warning due to type changes.
> 
> V5: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#t
> 
> Changes since V5:
> - Rename the 'owner' field as 'encl_owner' and update the references
>   as a separate patch.
> - To prevent casting the 'encl_owner' field, introduce a union with
>   another field - "vepc_vaddr", suggested by Dave Hansen.
> - Clean up the commit message of patch 02 suggested by Dave Hansen.
> - Remove patch 03 unless we have better reason to keep it.
> - Add Reviewed-by from Jarkko.
>   Link: https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m379d00fc7f1d43726a42b3884637532061a8c0d1
> 
> V4: https://lore.kernel.org/linux-sgx/20220608032654.1764936-1-zhiquan1.li@intel.com/T/#t
> 
> Changes since V4:
> - Switch the order of the two variables at patch 02 so all of variables
>   are in reverse Christmas style.
> - Do not initialize 'ret' because it will be overridden by the return
>   value of force_sig_mceerr() unconditionally.
> - Add Co-developed-by and Signed-off-by from Cathy Zhang at patch 01.
> - Add Acked-by from Kai Huang at patch 01.
> 
> V3: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#t
> 
> Changes since V3:
> - Take the definition of EPC page flag SGX_EPC_PAGE_KVM_GUEST from
>   Cathy Zhang's third patch of SGX rebootless recovery patch set but
>   discard irrelevant portion, since it might need some time to re-forge
>   and these are two different features.
>   Link: https://lore.kernel.org/linux-sgx/41704e5d4c03b49fcda12e695595211d950cfb08.camel@kernel.org/T/#m9782d23496cacecb7da07a67daa79f4b322ae170
> 
> V2: https://lore.kernel.org/linux-sgx/694234d7-6a0d-e85f-f2f9-e52b4a61e1ec@intel.com/T/#t
> 
> Changes since V2:
> - Repurpose the owner field as the virtual address of virtual EPC page
> - Remove struct sgx_vepc_page and relevant code.
> - Remove patch 01 as the changes are not necessary in new design.
> - Rework patch 02 suggested by Jarkko.
> - Adapt patch 03 and 04 since struct sgx_vepc_page was discarded.
> - Replace EPC page flag SGX_EPC_PAGE_IS_VEPC with
>   SGX_EPC_PAGE_KVM_GUEST as they are duplicated.
>   Link: https://lore.kernel.org/linux-sgx/eb95b32ecf3d44a695610cf7f2816785@intel.com/T/#u
> 
> V1: https://lore.kernel.org/linux-sgx/443cb425-009c-2784-56f4-5e707122de76@intel.com/T/#t
> 
> Changes since V1:
> - Updated cover letter and commit messages, added valuable
>   information from Jarkko, Tony and Kai's comments.
> - Added documentations for struct struct sgx_vepc and
>   struct sgx_vepc_page.
> 
> Hi everyone,
> 
> This series contains a few patches to fine grained SGX MCA behavior.
> Today, if a guest accesses an SGX EPC page with memory failure,
> the kernel behavior will kill the entire guest.  This blast radius is
> too large.  It would be idea to kill only the SGX application inside
> the guest.
> 
> To fix this, send a SIGBUS to host userspace (like QEMU) which can
> follow up by injecting a #MC to the guest.
> However, when a page triggers a machine check, it only reports the
> PFN.  But in order to inject #MC into hypervisor, the virtual address
> is required.  The 'encl_owner' field is useless in virtualization
> case, then repurpose it as 'vepc_vaddr' - the virtual address of the
> virtual EPC page for such case so that arch_memory_failure() can easily
> retrieve it.
> 
> Suppose an enclave is shared by multiple processes, when an enclave
> page triggers a machine check, the enclave will be disabled so that
> it couldn't be entered again.  Killing other processes with the same
> enclave mapped would perhaps be overkill, but they are going to find
> that the enclave is "dead" next time they try to use it.  Thanks for
> Jarkko’s head up and Tony’s clarification on this point.
> Unlike host enclaves, virtual EPC instance cannot be shared by multiple
> VMs. It is because how enclaves are created is totally up to the guest.
> Sharing virtual EPC instance will be very likely to unexpectedly break
> enclaves in all VMs.
> 
> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> being shared by multiple VMs via fork(). However KVM doesn't support
> running a VM across multiple mm structures, and the de facto userspace
> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> this should not happen.
> 
> This series is based on tip/x86/sgx.
> 
> Tests:
> 1. MCE injection test for SGX in VM.
>    As we expected, the application was killed and VM was alive.
> 2. Kernel selftest/sgx: PASS
> 3. Internal SGX stress test: PASS
> 4. kmemleak test: No memory leakage detected.
> 
> Much appreciate your feedback.
> 
> Best Regards,
> Zhiquan
> 
> Zhiquan Li (3):
>   x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner
>   x86/sgx: Introduce union with vepc_vaddr field for virtualization case
>   x86/sgx: Fine grained SGX MCA behavior for virtualization
> 
>  arch/x86/kernel/cpu/sgx/main.c | 48 +++++++++++++++++++++++++---------
>  arch/x86/kernel/cpu/sgx/sgx.h  |  8 +++++-
>  arch/x86/kernel/cpu/sgx/virt.c |  4 ++-
>  3 files changed, 46 insertions(+), 14 deletions(-)
> 
Hello Boris,

Could you please help to take a look at this common feature which is
asked from CSP customer who has deployed SGX instance already?

The patchset had gotten ACK/Review-by from Jarkko, and Kai, and the
design has been stable for quite a while with through validation running
background.

Just wondering if you can shepherd the patchset for sgx tip tree for
v6.1 merge window?

Much appreciated.

Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  2022-09-20  6:39 ` [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case Zhiquan Li
@ 2022-10-10 23:10   ` Dave Hansen
  2022-10-11  5:49     ` Zhiquan Li
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-10 23:10 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp,
	kai.huang
  Cc: seanjc, fan.du, cathy.zhang

On 9/19/22 23:39, Zhiquan Li wrote:
> --- a/arch/x86/kernel/cpu/sgx/virt.c
> +++ b/arch/x86/kernel/cpu/sgx/virt.c
> @@ -46,10 +46,12 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
>  	if (epc_page)
>  		return 0;
>  
> -	epc_page = sgx_alloc_epc_page(vepc, false);
> +	epc_page = sgx_alloc_epc_page((void *)addr, false);
>  	if (IS_ERR(epc_page))
>  		return PTR_ERR(epc_page);

One thing not clear from the changelog: This actually changes the value
getting passed into sgx_alloc_epc_page() and set in the page->owner field.

What effect does this have?  If I apply these and run the tree at this
commit, what happens?  What behavior changes?

Was this 'vepc' value simply not used before?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-09-20  6:39 ` [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization Zhiquan Li
@ 2022-10-10 23:20   ` Dave Hansen
  2022-10-11  4:44     ` Zhiquan Li
  2022-10-11 14:04   ` Dave Hansen
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-10 23:20 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp,
	kai.huang
  Cc: seanjc, fan.du, cathy.zhang

On 9/19/22 23:39, Zhiquan Li wrote:
> +	if (flags & MF_ACTION_REQUIRED) {
> +		/*
> +		 * Provide extra info to the task so that it can make further
> +		 * decision but not simply kill it. This is quite useful for
> +		 * virtualization case.
> +		 */

"Quite useful to the virtualization case", eh?

This code can only even be *run* in the virtualization case.

> +		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
> +			/*
> +			 * The 'encl_owner' field is repurposed, when allocating EPC
> +			 * page it was assigned to the virtual address of virtual EPC
> +			 * page.
> +			 */

That comment is talking about history, which is kinda silly.  Let's just
focus on what the code is doing today.

> +			vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK);

First, why even align this?  What does it matter?

Second, is there a reason not to use PTR_ALIGN() here?

> +			ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT);
> +			if (ret < 0)
> +				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
> +					current->comm, current->pid, ret);
> +		} else
> +			force_sig(SIGBUS);
> +	}


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-10 23:20   ` Dave Hansen
@ 2022-10-11  4:44     ` Zhiquan Li
  0 siblings, 0 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-10-11  4:44 UTC (permalink / raw)
  To: Dave Hansen, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, tglx, bp


On 2022/10/11 07:20, Dave Hansen wrote:
> On 9/19/22 23:39, Zhiquan Li wrote:
>> +	if (flags & MF_ACTION_REQUIRED) {
>> +		/*
>> +		 * Provide extra info to the task so that it can make further
>> +		 * decision but not simply kill it. This is quite useful for
>> +		 * virtualization case.
>> +		 */
> 
> "Quite useful to the virtualization case", eh?
> 
> This code can only even be *run* in the virtualization case.
> 

OK, I'll change it in V10.

>> +		if (page->flags & SGX_EPC_PAGE_KVM_GUEST) {
>> +			/*
>> +			 * The 'encl_owner' field is repurposed, when allocating EPC
>> +			 * page it was assigned to the virtual address of virtual EPC
>> +			 * page.
>> +			 */
> 
> That comment is talking about history, which is kinda silly.  Let's just
> focus on what the code is doing today.
> 

How about move the comment above to here? Because it's talking about
what the code is doing today.
Then this comment can be removed.

>> +			vaddr = (void *)((unsigned long)page->vepc_vaddr & PAGE_MASK);
> 
> First, why even align this?  What does it matter?
> 
> Second, is there a reason not to use PTR_ALIGN() here?
> 

You're right.  The align is unnecessary, host kernel just reports the
error address, no matter if it's aligned, guest kernel MCA will convert
it to PFN.

Best Regards,
Zhiquan

>> +			ret = force_sig_mceerr(BUS_MCEERR_AR, vaddr, PAGE_SHIFT);
>> +			if (ret < 0)
>> +				pr_err("Memory failure: Error sending signal to %s:%d: %d\n",
>> +					current->comm, current->pid, ret);
>> +		} else
>> +			force_sig(SIGBUS);
>> +	}
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  2022-10-10 23:10   ` Dave Hansen
@ 2022-10-11  5:49     ` Zhiquan Li
  2022-10-11 13:57       ` Dave Hansen
  0 siblings, 1 reply; 46+ messages in thread
From: Zhiquan Li @ 2022-10-11  5:49 UTC (permalink / raw)
  To: Dave Hansen, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, tglx, bp



On 2022/10/11 07:10, Dave Hansen wrote:
> On 9/19/22 23:39, Zhiquan Li wrote:
>> --- a/arch/x86/kernel/cpu/sgx/virt.c
>> +++ b/arch/x86/kernel/cpu/sgx/virt.c
>> @@ -46,10 +46,12 @@ static int __sgx_vepc_fault(struct sgx_vepc *vepc,
>>  	if (epc_page)
>>  		return 0;
>>  
>> -	epc_page = sgx_alloc_epc_page(vepc, false);
>> +	epc_page = sgx_alloc_epc_page((void *)addr, false);
>>  	if (IS_ERR(epc_page))
>>  		return PTR_ERR(epc_page);
> 
> One thing not clear from the changelog: This actually changes the value
> getting passed into sgx_alloc_epc_page() and set in the page->owner field.
> 
> What effect does this have?  If I apply these and run the tree at this
> commit, what happens?  What behavior changes?
> 
> Was this 'vepc' value simply not used before?

Yes, it was not used before. Kai had confirmed this point:
https://lore.kernel.org/all/fa93057f417b1f630d8199381589c415a0ec710b.camel@intel.com/

The initial idea is to add a new struct sgx_vepc_page to hold 'vaddr'
and the reversed relationship from EPC page to struct sgx_vepc:

  struct sgx_vepc_page {
  	unsigned long vaddr;
  	struct sgx_vepc *vepc;
  };

But which means there will be additional 16 bytes memory consumption on
host for each EPC page assigned into guest.

Jarkko suggested us repurpose the 'owner' field:
https://lore.kernel.org/linux-sgx/Yoa90l89OTQX0NYk@iki.fi/

1. It can save memory.
2. We don't have any scenario need the reversed relationship from EPC
page to struct sgx_vepc, keeping the relationship from VEPC to EPC pages
is enough.

May I add below description to explain behavior changes in changelog?

The behavior is changed when allocating an EPC page to a virtual EPC,
the virtual address of the virtual EPC will be passed as the first
argument of sgx_alloc_epc_page() which be assigned to 'encl_owner' field
of struct sgx_epc_page.  After that, the reversed relationship from an
EPC page to virtual EPC doesn't exist, in practice, such relationship is
useless.

Thanks & Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  2022-10-11  5:49     ` Zhiquan Li
@ 2022-10-11 13:57       ` Dave Hansen
  2022-10-12  4:42         ` Zhiquan Li
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-11 13:57 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, tglx, bp

On 10/10/22 22:49, Zhiquan Li wrote:
> 
> May I add below description to explain behavior changes in changelog?
> 
> The behavior is changed when allocating an EPC page to a virtual EPC,
> the virtual address of the virtual EPC will be passed as the first
> argument of sgx_alloc_epc_page() which be assigned to 'encl_owner' field
> of struct sgx_epc_page.  After that, the reversed relationship from an
> EPC page to virtual EPC doesn't exist, in practice, such relationship is
> useless.

That honestly doesn't help me understand what's going on here.

The changelog and subject are spending far too much time explaining the
low level details of what is being done (like "add a union") and far too
little time explaining what it _means_.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-09-20  6:39 ` [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization Zhiquan Li
  2022-10-10 23:20   ` Dave Hansen
@ 2022-10-11 14:04   ` Dave Hansen
  2022-10-12  5:09     ` Zhiquan Li
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-11 14:04 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, tglx, bp,
	kai.huang
  Cc: seanjc, fan.du, cathy.zhang

On 9/19/22 23:39, Zhiquan Li wrote:
> Today, if a guest accesses an SGX EPC page with memory failure,
> the kernel behavior will kill the entire guest.  This blast
> radius is too large.  It would be idea to kill only the SGX

				ideal ^

> application inside the guest.
> 
> To fix this, send a SIGBUS to host userspace (like QEMU) which can
> follow up by injecting a #MC to the guest.

This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
So, whatever is making this better, it's not "send a SIGBUS" that's
doing it.

What does this patch actually do to reduce the blast radius?

> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> being shared by multiple VMs via fork().  However KVM doesn't support
> running a VM across multiple mm structures, and the de facto userspace
> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> this should not happen.

This is out of the blue.  Why is this here?

What happens if a hypervisor *DOES* fork()?  What's the fallout?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  2022-10-11 13:57       ` Dave Hansen
@ 2022-10-12  4:42         ` Zhiquan Li
  2022-10-12 11:17           ` Huang, Kai
  0 siblings, 1 reply; 46+ messages in thread
From: Zhiquan Li @ 2022-10-12  4:42 UTC (permalink / raw)
  To: Dave Hansen, linux-sgx, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, tglx, bp, tony.luck


On 2022/10/11 21:57, Dave Hansen wrote:
> That honestly doesn't help me understand what's going on here.
> 
> The changelog and subject are spending far too much time explaining the
> low level details of what is being done (like "add a union") and far too
> little time explaining what it _means_.

Thanks for your comments, Dave.

I added more descriptions for motivation and reduced explanations
for details in changelog as below, could you help me review if it is
easier to understand now?

    When a page triggers a machine check, it only reports the PFN.  But
    in order to inject #MC into VM through hypervisor, the virtual
    address of virtual EPC page is required.  To avoid introduce a new
    dedicated structure to track it, repurpose the 'encl_owner' field as
    it is useless in virtualization case, so that arch_memory_failure()
    can retrieve the virtual address easily and attach it to the extra
    info of SIGBUS which will be sent to hypervisor.

    The union can prevent playing the casting games while using it.  And
    add a new EPC page flag - SGX_EPC_PAGE_KVM_GUEST to interpret the
    meaning of the field.

    The changes take place while allocating an EPC page to a KVM guests.
    After applying it, the reversed relationship from an EPC page to
    virtual EPC doesn't exist, in practice, such relationship is
    useless. Beyond that, other behaviors have not been changed.

Furthermore, may I change the subject as below to highlight the
motivation?

    x86/sgx: track virtual address of virtual EPC page for injecting #MC
    into VM

Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-11 14:04   ` Dave Hansen
@ 2022-10-12  5:09     ` Zhiquan Li
  2022-10-12 11:01       ` Huang, Kai
  2022-10-12 14:36       ` Dave Hansen
  0 siblings, 2 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-10-12  5:09 UTC (permalink / raw)
  To: Dave Hansen, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx


On 2022/10/11 22:04, Dave Hansen wrote:
> On 9/19/22 23:39, Zhiquan Li wrote:
>> Today, if a guest accesses an SGX EPC page with memory failure,
>> the kernel behavior will kill the entire guest.  This blast
>> radius is too large.  It would be idea to kill only the SGX
> 
> 				ideal ^
> 
>> application inside the guest.
>>
>> To fix this, send a SIGBUS to host userspace (like QEMU) which can
>> follow up by injecting a #MC to the guest.
> 
> This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
> So, whatever is making this better, it's not "send a SIGBUS" that's
> doing it.
> 
> What does this patch actually do to reduce the blast radius?
> 

Thanks for your attention, Dave.

This part comes from your comments previously:

https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d

The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address
of virtual EPC page. Hypervisor can identify it with the specific code
and inject #MC to the guest.

Can we change the statement like this?

    To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual
    address of virtual EPC page to host userspace (like QEMU) which can
    follow up by injecting a #MC to the guest.

>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>> being shared by multiple VMs via fork().  However KVM doesn't support
>> running a VM across multiple mm structures, and the de facto userspace
>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>> this should not happen.
> 
> This is out of the blue.  Why is this here?
> 
> What happens if a hypervisor *DOES* fork()?  What's the fallout?

This part originates from below discussion:

https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t

It intents to answer the question:

    Do you think the processes sharing the same enclave need to be
    killed, even they had not touched the EPC page with hardware error?

Dave, do you mean it's not appropriate to be put here?

Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-12  5:09     ` Zhiquan Li
@ 2022-10-12 11:01       ` Huang, Kai
  2022-10-12 11:54         ` jarkko
  2022-10-13  2:05         ` Zhiquan Li
  2022-10-12 14:36       ` Dave Hansen
  1 sibling, 2 replies; 46+ messages in thread
From: Huang, Kai @ 2022-10-12 11:01 UTC (permalink / raw)
  To: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote:
> > > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> > > being shared by multiple VMs via fork().  However KVM doesn't support
> > > running a VM across multiple mm structures, and the de facto userspace
> > > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> > > this should not happen.
> > 
> > This is out of the blue.  Why is this here?
> > 
> > What happens if a hypervisor *DOES* fork()?  What's the fallout?
> 
> This part originates from below discussion:
> 
> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> 
> It intents to answer the question:
> 
>     Do you think the processes sharing the same enclave need to be
>     killed, even they had not touched the EPC page with hardware error?

Sharing virtual EPC instance will very likely unexpectedly break enclaves in all
VMs.  Whether kernel should explicitly prevent is another topic. To me I don't
see strong reason to enforce in the kernel.  For instance, multiple VMs can map
the same file as memory backend with MAP_SHARED, in which case they can all
break.  Userspace should use virtual EPC in the right way.

But the point is above is not directly related to your patch.  On host where
multiple processes can share one enclave legally, it does the same thing.  I
think you can just remove that paragraph from changelog.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case
  2022-10-12  4:42         ` Zhiquan Li
@ 2022-10-12 11:17           ` Huang, Kai
  0 siblings, 0 replies; 46+ messages in thread
From: Huang, Kai @ 2022-10-12 11:17 UTC (permalink / raw)
  To: linux-sgx, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Luck, Tony, Christopherson,, Sean, Zhang, Cathy, bp

On Wed, 2022-10-12 at 12:42 +0800, Zhiquan Li wrote:
> I added more descriptions for motivation and reduced explanations
> for details in changelog as below, could you help me review if it is
> easier to understand now?
> 
>     When a page triggers a machine check, it only reports the PFN.  But
>     in order to inject #MC into VM through hypervisor, the virtual
>     address of virtual EPC page is required.  To avoid introduce a new
>     dedicated structure to track it, repurpose the 'encl_owner' field as
>     it is useless in virtualization case, so that arch_memory_failure()
>     can retrieve the virtual address easily and attach it to the extra
>     info of SIGBUS which will be sent to hypervisor.
> 
>     The union can prevent playing the casting games while using it.  And
>     add a new EPC page flag - SGX_EPC_PAGE_KVM_GUEST to interpret the
>     meaning of the field.
> 
>     The changes take place while allocating an EPC page to a KVM guests.
>     After applying it, the reversed relationship from an EPC page to
>     virtual EPC doesn't exist, in practice, such relationship is
>     useless. Beyond that, other behaviors have not been changed.

To me the above last paragraph is very hard to comprehend.  I would just point
out the fact:

Virtual EPC pages's 'encl_owner' is not used so it's safe to repurpose it to
carry virtual EPC page's virtual address. 

(and perhaps somehow to merge to your first paragraph where why the virtual
address is needed is explained.)

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-12 11:01       ` Huang, Kai
@ 2022-10-12 11:54         ` jarkko
  2022-10-12 20:56           ` Huang, Kai
  2022-10-13  2:05         ` Zhiquan Li
  1 sibling, 1 reply; 46+ messages in thread
From: jarkko @ 2022-10-12 11:54 UTC (permalink / raw)
  To: Huang, Kai
  Cc: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, dave.hansen,
	tglx, Du, Fan, Christopherson,,
	Sean, Zhang, Cathy, bp

On Wed, Oct 12, 2022 at 11:01:49AM +0000, Huang, Kai wrote:
> On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote:
> > > > SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
> > > > being shared by multiple VMs via fork().  However KVM doesn't support
> > > > running a VM across multiple mm structures, and the de facto userspace
> > > > hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
> > > > this should not happen.
> > > 
> > > This is out of the blue.  Why is this here?
> > > 
> > > What happens if a hypervisor *DOES* fork()?  What's the fallout?
> > 
> > This part originates from below discussion:
> > 
> > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> > 
> > It intents to answer the question:
> > 
> >     Do you think the processes sharing the same enclave need to be
> >     killed, even they had not touched the EPC page with hardware error?
> 
> Sharing virtual EPC instance will very likely unexpectedly break enclaves in all
> VMs.  Whether kernel should explicitly prevent is another topic. To me I don't
> see strong reason to enforce in the kernel.  For instance, multiple VMs can map
> the same file as memory backend with MAP_SHARED, in which case they can all
> break.  Userspace should use virtual EPC in the right way.

Broadly speaking, for most of the time, and for any topic, kernel should
not prevent anything, unless it can break kernel's internal state.

> But the point is above is not directly related to your patch.  On host where
> multiple processes can share one enclave legally, it does the same thing.  I
> think you can just remove that paragraph from changelog.
> 
> -- 
> Thanks,
> -Kai
> 
> 

BR, Jarkko

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-12  5:09     ` Zhiquan Li
  2022-10-12 11:01       ` Huang, Kai
@ 2022-10-12 14:36       ` Dave Hansen
  2022-10-13 14:40         ` Zhiquan Li
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-12 14:36 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx

On 10/11/22 22:09, Zhiquan Li wrote:
>>> application inside the guest.
>>>
>>> To fix this, send a SIGBUS to host userspace (like QEMU) which can
>>> follow up by injecting a #MC to the guest.
>>
>> This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
>> So, whatever is making this better, it's not "send a SIGBUS" that's
>> doing it.
>>
>> What does this patch actually do to reduce the blast radius?
> 
> Thanks for your attention, Dave.
> 
> This part comes from your comments previously:
> 
> https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@kernel.org/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d
> 
> The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address
> of virtual EPC page. Hypervisor can identify it with the specific code
> and inject #MC to the guest.
> 
> Can we change the statement like this?
> 
>     To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual
>     address of virtual EPC page to host userspace (like QEMU) which can
>     follow up by injecting a #MC to the guest.

This is really just mechanically restating what the patch does.  It
doesn't help me understand how it achieves the goal.  I guess I'll just
go and write the changelog for you.  Here's what I was missing:

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from <add example here>, so it is very unlikely to be
	able to recover from the signal and the app will die.

	To fix this, have the SGX machine check code generate a SIGBUS
	which leverages the existing BUS_MCEERR_AR ABI.  This enlightens
	userspace about why the SIGBUS was generated and gives it a
	chance of being able to handle the signal.

	QEMU, for instance, has code to handle these BUS_MCEERR_AR
	signals today.  Without this patch, QEMU will just die in the
	face of a generic SIGBUS, and take the whole VM with it.  With
	this patch <explain what QEMU actually does here>.

	In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius
	down from the whole QEMU process to a single page.

This patch doesn't *actually* reduce the blast radius.  It enables QEMU
to do that.

>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>> running a VM across multiple mm structures, and the de facto userspace
>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>> this should not happen.
>>
>> This is out of the blue.  Why is this here?
>>
>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
> 
> This part originates from below discussion:
> 
> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> 
> It intents to answer the question:
> 
>     Do you think the processes sharing the same enclave need to be
>     killed, even they had not touched the EPC page with hardware error?
> 
> Dave, do you mean it's not appropriate to be put here?

It's actually a pretty important point, but it's still out of the blue.

You also didn't answer my question.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-12 11:54         ` jarkko
@ 2022-10-12 20:56           ` Huang, Kai
  0 siblings, 0 replies; 46+ messages in thread
From: Huang, Kai @ 2022-10-12 20:56 UTC (permalink / raw)
  To: jarkko
  Cc: Hansen, Dave, linux-sgx, Christopherson,,
	Sean, dave.hansen, bp, Zhang, Cathy, tglx, Luck, Tony, Du, Fan,
	Li, Zhiquan1

On Wed, 2022-10-12 at 14:54 +0300, jarkko@kernel.org wrote:
> > Sharing virtual EPC instance will very likely unexpectedly break enclaves in
> > all
> > VMs.  Whether kernel should explicitly prevent is another topic. To me I
> > don't
> > see strong reason to enforce in the kernel.  For instance, multiple VMs can
> > map
> > the same file as memory backend with MAP_SHARED, in which case they can all
> > break.  Userspace should use virtual EPC in the right way.
> 
> Broadly speaking, for most of the time, and for any topic, kernel should
> not prevent anything, unless it can break kernel's internal state.

Good to know.  Thanks you.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-12 11:01       ` Huang, Kai
  2022-10-12 11:54         ` jarkko
@ 2022-10-13  2:05         ` Zhiquan Li
  1 sibling, 0 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-10-13  2:05 UTC (permalink / raw)
  To: Huang, Kai, linux-sgx, Luck, Tony, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp


On 2022/10/12 19:01, Huang, Kai wrote:
> On Wed, 2022-10-12 at 13:09 +0800, Zhiquan Li wrote:
>>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>>> running a VM across multiple mm structures, and the de facto userspace
>>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>>> this should not happen.
>>>
>>> This is out of the blue.  Why is this here?
>>>
>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>>
>> This part originates from below discussion:
>>
>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
>>
>> It intents to answer the question:
>>
>>     Do you think the processes sharing the same enclave need to be
>>     killed, even they had not touched the EPC page with hardware error?
> 
> Sharing virtual EPC instance will very likely unexpectedly break enclaves in all
> VMs.  Whether kernel should explicitly prevent is another topic. To me I don't
> see strong reason to enforce in the kernel.  For instance, multiple VMs can map
> the same file as memory backend with MAP_SHARED, in which case they can all
> break.  Userspace should use virtual EPC in the right way.
> 
> But the point is above is not directly related to your patch.  On host where
> multiple processes can share one enclave legally, it does the same thing.  I
> think you can just remove that paragraph from changelog.
> 

OK, I'll remove it since V10.
Thank you all the same, Kai.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-12 14:36       ` Dave Hansen
@ 2022-10-13 14:40         ` Zhiquan Li
  2022-10-13 15:39           ` Dave Hansen
  2022-10-13 15:44           ` Dave Hansen
  0 siblings, 2 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-10-13 14:40 UTC (permalink / raw)
  To: Dave Hansen, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx


On 2022/10/12 22:36, Dave Hansen wrote:
> This is really just mechanically restating what the patch does.  It
> doesn't help me understand how it achieves the goal.  I guess I'll just
> go and write the changelog for you.  Here's what I was missing:
> 
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from <add example here>, so it is very unlikely to be
> 	able to recover from the signal and the app will die.
> 
> 	To fix this, have the SGX machine check code generate a SIGBUS
> 	which leverages the existing BUS_MCEERR_AR ABI.  This enlightens
> 	userspace about why the SIGBUS was generated and gives it a
> 	chance of being able to handle the signal.
> 
> 	QEMU, for instance, has code to handle these BUS_MCEERR_AR
> 	signals today.  Without this patch, QEMU will just die in the
> 	face of a generic SIGBUS, and take the whole VM with it.  With
> 	this patch <explain what QEMU actually does here>.
> 
> 	In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius
> 	down from the whole QEMU process to a single page.
> 
> This patch doesn't *actually* reduce the blast radius.  It enables QEMU
> to do that.
> 

Perfect, Dave! I'm ashamed to have to get your hands dirty in the end.
I'd like to replace the angle brackets content as below, could you have
a look?

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from <a recoverable machine check error>, so it is very
	unlikely to be able to recover from the signal and the app will
	die.
	...
	QEMU, for instance, has code to handle these BUS_MCEERR_AR
	signals today.  Without this patch, QEMU will just die in the
	face of a generic SIGBUS, and take the whole VM with it.  With
	this patch <QEMU will be aware of the machine check error and
	retrieve the virtual address (HVA) from siginfo, then convert it
	to GPA and inject a #MC into VM. Guest kernel will handle this
	#MC, find and kill the task who had accessed the error memory.>
	...

>>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>>> running a VM across multiple mm structures, and the de facto userspace
>>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>>> this should not happen.
>>> This is out of the blue.  Why is this here?
>>>
>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>> This part originates from below discussion:
>>
>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
>>
>> It intents to answer the question:
>>
>>     Do you think the processes sharing the same enclave need to be
>>     killed, even they had not touched the EPC page with hardware error?
>>
>> Dave, do you mean it's not appropriate to be put here?
> It's actually a pretty important point, but it's still out of the blue.
> 
> You also didn't answer my question.

Oh, sorry, I focused on answering "Why is this here?" but forgot to
answer "What's the fallout?"

It's a very good question.

Looks like Kai had answered it before:

	For instance, an application can open /dev/sgx_vepc, mmap(), and
	fork().  Then if the child does mmap() against the fd opened by
	the parent, the child will share the same vepc with the parent.

	...

	Sharing virtual EPC instance will very likely unexpectedly break
	enclaves in all VMs.
	...

https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t

Moreover, I checked the code in below scenario while child sharing the
virtual EPC instance with parent:
- child closes /dev/sgx_vepc earlier than parent.
- child re-mmap() /dev/sgx_vepc to the same memory region as parent.
- child mmap() /dev/sgx_vepc to different memory region.
- child accesses the memory region of mmap() inherited from parent.

It's just that the app messes itself up, the vepc instance is protected
very well.
Maybe there are other corner cases I've not considered?

Best Regards,
Zhiquan


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 14:40         ` Zhiquan Li
@ 2022-10-13 15:39           ` Dave Hansen
  2022-10-14  5:42             ` Zhiquan Li
  2022-10-13 15:44           ` Dave Hansen
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-13 15:39 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx

On 10/13/22 07:40, Zhiquan Li wrote:
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from <a recoverable machine check error>,

I wanted an example here of a SIGBUS that can occur from sources *OTHER*
than a recoverable machine check.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 14:40         ` Zhiquan Li
  2022-10-13 15:39           ` Dave Hansen
@ 2022-10-13 15:44           ` Dave Hansen
  2022-10-13 21:49             ` Huang, Kai
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-13 15:44 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx

On 10/13/22 07:40, Zhiquan Li wrote:
>>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>>> This part originates from below discussion:
>>>
>>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
>>>
>>> It intents to answer the question:
>>>
>>>     Do you think the processes sharing the same enclave need to be
>>>     killed, even they had not touched the EPC page with hardware error?
>>>
>>> Dave, do you mean it's not appropriate to be put here?
>> It's actually a pretty important point, but it's still out of the blue.
>>
>> You also didn't answer my question.
> Oh, sorry, I focused on answering "Why is this here?" but forgot to
> answer "What's the fallout?"
> 
> It's a very good question.
> 
> Looks like Kai had answered it before:
> 
> 	For instance, an application can open /dev/sgx_vepc, mmap(), and
> 	fork().  Then if the child does mmap() against the fd opened by
> 	the parent, the child will share the same vepc with the parent.
> 
> 	...
> 
> 	Sharing virtual EPC instance will very likely unexpectedly break
> 	enclaves in all VMs.

How, though?  This basically says, "I think things will break."  I want
to know a few more specifics than that before we merge this.  There are
ABI implications.

> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> 
> Moreover, I checked the code in below scenario while child sharing the
> virtual EPC instance with parent:
> - child closes /dev/sgx_vepc earlier than parent.
> - child re-mmap() /dev/sgx_vepc to the same memory region as parent.
> - child mmap() /dev/sgx_vepc to different memory region.
> - child accesses the memory region of mmap() inherited from parent.
> 
> It's just that the app messes itself up, the vepc instance is protected
> very well.
> Maybe there are other corner cases I've not considered?

... and what happens when *THIS* patch is in play?  What if there is a
machine check in SGX memory?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 15:44           ` Dave Hansen
@ 2022-10-13 21:49             ` Huang, Kai
  2022-10-13 22:02               ` Dave Hansen
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-10-13 21:49 UTC (permalink / raw)
  To: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On Thu, 2022-10-13 at 08:44 -0700, Dave Hansen wrote:
> On 10/13/22 07:40, Zhiquan Li wrote:
> > > > > What happens if a hypervisor *DOES* fork()?  What's the fallout?
> > > > This part originates from below discussion:
> > > > 
> > > > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> > > > 
> > > > It intents to answer the question:
> > > > 
> > > >     Do you think the processes sharing the same enclave need to be
> > > >     killed, even they had not touched the EPC page with hardware error?
> > > > 
> > > > Dave, do you mean it's not appropriate to be put here?
> > > It's actually a pretty important point, but it's still out of the blue.
> > > 
> > > You also didn't answer my question.
> > Oh, sorry, I focused on answering "Why is this here?" but forgot to
> > answer "What's the fallout?"
> > 
> > It's a very good question.
> > 
> > Looks like Kai had answered it before:
> > 
> > 	For instance, an application can open /dev/sgx_vepc, mmap(), and
> > 	fork().  Then if the child does mmap() against the fd opened by
> > 	the parent, the child will share the same vepc with the parent.
> > 
> > 	...
> > 
> > 	Sharing virtual EPC instance will very likely unexpectedly break
> > 	enclaves in all VMs.
> 
> How, though?  This basically says, "I think things will break."  I want
> to know a few more specifics than that before we merge this.  There are
> ABI implications.

This is because virtual EPC is just a raw resource to guest, and how guest uses
virtual EPC to run enclaves is completely controlled by the guest.  When virtual
EPC is shared among VMs, it can happen that one guest _thinks_ one EPC page is
still free but in fact it has been already used by another VM as a valid enclave
page.  Also, one VM can just unconditionally sanitize all EPC pages before using
any of them (just like Linux does).  All of those can cause unexpected SGX
error, which can lead to failing to create enclave, and/or breaking existing
enclaves running in all VMs.

> 
> > https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@intel.com/#t
> > 
> > Moreover, I checked the code in below scenario while child sharing the
> > virtual EPC instance with parent:
> > - child closes /dev/sgx_vepc earlier than parent.
> > - child re-mmap() /dev/sgx_vepc to the same memory region as parent.
> > - child mmap() /dev/sgx_vepc to different memory region.
> > - child accesses the memory region of mmap() inherited from parent.
> > 
> > It's just that the app messes itself up, the vepc instance is protected
> > very well.
> > Maybe there are other corner cases I've not considered?
> 
> ... and what happens when *THIS* patch is in play?  What if there is a
> machine check in SGX memory?

With this patch, when #MC happens on one virtual EPC page, it will be send to
the VM, and the behaviour inside a VM depends on guest's implementation.  But
anyway based on Tony's reply, the enclave will be marked as "bad" by hardware
and it will eventually be killed:

https://lore.kernel.org/linux-sgx/55ffd9475f5d46f68dd06c4323bec871@intel.com/
https://lore.kernel.org/linux-sgx/5b6ad3e2af614caf9b41092797ffcd86@intel.com/

If the virtual EPC is shared by other VMs, the worst case is when other VMs use
this bad EPC page (as we cannot take the bad EPC page out of VM for now), some
SGX error (ENCLS/ENCLU error) or #MC could happen again.  But this doesn't make
things worse, as when sharing virtual EPC page among VMs you are likely to break
enclaves in VMs anyway (as mentioned above).


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 21:49             ` Huang, Kai
@ 2022-10-13 22:02               ` Dave Hansen
  2022-10-13 22:15                 ` Huang, Kai
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-13 22:02 UTC (permalink / raw)
  To: Huang, Kai, linux-sgx, Luck, Tony, Li, Zhiquan1, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On 10/13/22 14:49, Huang, Kai wrote:
>> ... and what happens when *THIS* patch is in play?  What if there is a
>> machine check in SGX memory?
> With this patch, when #MC happens on one virtual EPC page, it will be send to
> the VM, and the behaviour inside a VM depends on guest's implementation.  But
> anyway based on Tony's reply, the enclave will be marked as "bad" by hardware
> and it will eventually be killed:
> 
> https://lore.kernel.org/linux-sgx/55ffd9475f5d46f68dd06c4323bec871@intel.com/
> https://lore.kernel.org/linux-sgx/5b6ad3e2af614caf9b41092797ffcd86@intel.com/
> 
> If the virtual EPC is shared by other VMs, the worst case is when other VMs use
> this bad EPC page (as we cannot take the bad EPC page out of VM for now), some
> SGX error (ENCLS/ENCLU error) or #MC could happen again.  But this doesn't make
> things worse, as when sharing virtual EPC page among VMs you are likely to break
> enclaves in VMs anyway (as mentioned above).

Guys, maybe I'm just not being specific enough.  But, Kai, you're off on
a tangent here that I didn't intend to ask about.  You're actually
adding confusion and delay here, not helping answer the question.

This was specifically a question about fork() plus VEPC plus machine
checks.  I'm still trying to get to the bottom of:

	What happens if a hypervisor *DOES* fork()?  What's the fallout?

Specifically, with this machine check SIGBUS implementation, one EPC
page can only have on host virtual address.  But, it's possible to
mmap() a single VEPC page in multiple processes at multiple host virtual
addresses.

So, the implementation only allows one host virtual address.  But, we
can imagine a scenario where there are *TWO* valid virtual addresses
where the VEPC is mapped.

What might happen in this case?  What's the fallout?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 22:02               ` Dave Hansen
@ 2022-10-13 22:15                 ` Huang, Kai
  2022-10-13 22:28                   ` Dave Hansen
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-10-13 22:15 UTC (permalink / raw)
  To: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote:
> Specifically, with this machine check SIGBUS implementation, one EPC
> page can only have on host virtual address.  But, it's possible to
> mmap() a single VEPC page in multiple processes at multiple host virtual
> addresses.
> 
> So, the implementation only allows one host virtual address.  But, we
> can imagine a scenario where there are *TWO* valid virtual addresses
> where the VEPC is mapped.
> 
> What might happen in this case?  What's the fallout?

My understanding is there can be two #MC happening on two virtual addresses.

Each #MC will be injected to the VM which triggers that #MC to handle.

Today, both VM gets killed.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 22:15                 ` Huang, Kai
@ 2022-10-13 22:28                   ` Dave Hansen
  2022-10-13 23:40                     ` Huang, Kai
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-13 22:28 UTC (permalink / raw)
  To: Huang, Kai, linux-sgx, Luck, Tony, Li, Zhiquan1, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On 10/13/22 15:15, Huang, Kai wrote:
> On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote:
>> Specifically, with this machine check SIGBUS implementation, one EPC
>> page can only have on host virtual address.  But, it's possible to
>> mmap() a single VEPC page in multiple processes at multiple host virtual
>> addresses.
>>
>> So, the implementation only allows one host virtual address.  But, we
>> can imagine a scenario where there are *TWO* valid virtual addresses
>> where the VEPC is mapped.
>>
>> What might happen in this case?  What's the fallout?
> 
> My understanding is there can be two #MC happening on two virtual addresses.
> 
> Each #MC will be injected to the VM which triggers that #MC to handle.

OK but which address with each of them see in siginfo->si_addr?  There's
only one ->vepc_vaddr.

It seems to me like this might result in the siginfo getting populated
with the ->si_addr from the *OTHER* process and *OTHER* virtual address.
 Basically, whoever allocates the page sets up *their* virtual address.
 Anyone else who gets a machine check on that page will see a virtual
address that has nothing to do with its virtual address space.

Is that problematic?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 22:28                   ` Dave Hansen
@ 2022-10-13 23:40                     ` Huang, Kai
  2022-10-13 23:57                       ` Dave Hansen
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-10-13 23:40 UTC (permalink / raw)
  To: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On Thu, 2022-10-13 at 15:28 -0700, Dave Hansen wrote:
> On 10/13/22 15:15, Huang, Kai wrote:
> > On Thu, 2022-10-13 at 15:02 -0700, Hansen, Dave wrote:
> > > Specifically, with this machine check SIGBUS implementation, one EPC
> > > page can only have on host virtual address.  But, it's possible to
> > > mmap() a single VEPC page in multiple processes at multiple host virtual
> > > addresses.
> > > 
> > > So, the implementation only allows one host virtual address.  But, we
> > > can imagine a scenario where there are *TWO* valid virtual addresses
> > > where the VEPC is mapped.
> > > 
> > > What might happen in this case?  What's the fallout?
> > 
> > My understanding is there can be two #MC happening on two virtual addresses.
> > 
> > Each #MC will be injected to the VM which triggers that #MC to handle.
> 
> OK but which address with each of them see in siginfo->si_addr?  There's
> only one ->vepc_vaddr.
> 
> It seems to me like this might result in the siginfo getting populated
> with the ->si_addr from the *OTHER* process and *OTHER* virtual address.
>  Basically, whoever allocates the page sets up *their* virtual address.
>  Anyone else who gets a machine check on that page will see a virtual
> address that has nothing to do with its virtual address space.
> 
> Is that problematic?

Thanks Dave.  I see the problem now.  I already forgot the reason that I said
"whether to add a comment is good enough" in my original reply.

So with allowing fork(), putting the 'virtual address' to the owner doesn't
seems right.  I think perhaps we have two options:

1) to enforce in kernel that virtual EPC doesn't support fork(), and page's
owner can be used to carry 'virtual address'.
2) in arch_memory_failure, we find out the virtual address based on the current
process, but I am not entirely sure whether it can work, i.e. it's not called
from interrupt context?

What's your suggestion?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 23:40                     ` Huang, Kai
@ 2022-10-13 23:57                       ` Dave Hansen
  2022-10-14  0:19                         ` Huang, Kai
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-10-13 23:57 UTC (permalink / raw)
  To: Huang, Kai, linux-sgx, Luck, Tony, Li, Zhiquan1, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On 10/13/22 16:40, Huang, Kai wrote:
> What's your suggestion?

I'm fresh out of suggestions.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 23:57                       ` Dave Hansen
@ 2022-10-14  0:19                         ` Huang, Kai
  2022-10-19 10:59                           ` Huang, Kai
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-10-14  0:19 UTC (permalink / raw)
  To: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On Thu, 2022-10-13 at 16:57 -0700, Dave Hansen wrote:
> On 10/13/22 16:40, Huang, Kai wrote:
> > What's your suggestion?
> 
> I'm fresh out of suggestions.

Thanks Dave.  I'll dig more and find out a solution.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-14  5:42             ` Zhiquan Li
@ 2022-10-14  5:41               ` Dave Hansen
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Hansen @ 2022-10-14  5:41 UTC (permalink / raw)
  To: Zhiquan Li, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx

On 10/13/22 22:42, Zhiquan Li wrote:
> On 2022/10/13 23:39, Dave Hansen wrote:
>> I wanted an example here of a SIGBUS that can occur from sources *OTHER*
>> than a recoverable machine check.
> Oh, got it, it should like this:
> 
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from invalid address alignment, nonexistent physical
> 	address or object-specific hardware error, so it is very
> 	unlikely to be able to recover from the signal and the app will
> 	die.

That looks much better, thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-13 15:39           ` Dave Hansen
@ 2022-10-14  5:42             ` Zhiquan Li
  2022-10-14  5:41               ` Dave Hansen
  0 siblings, 1 reply; 46+ messages in thread
From: Zhiquan Li @ 2022-10-14  5:42 UTC (permalink / raw)
  To: Dave Hansen, linux-sgx, tony.luck, jarkko, dave.hansen, kai.huang
  Cc: seanjc, fan.du, cathy.zhang, bp, tglx


On 2022/10/13 23:39, Dave Hansen wrote:
> I wanted an example here of a SIGBUS that can occur from sources *OTHER*
> than a recoverable machine check.

Oh, got it, it should like this:

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from invalid address alignment, nonexistent physical
	address or object-specific hardware error, so it is very
	unlikely to be able to recover from the signal and the app will
	die.

Thanks for the clarification, Dave.

Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-14  0:19                         ` Huang, Kai
@ 2022-10-19 10:59                           ` Huang, Kai
  2022-10-23 20:39                             ` jarkko
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-10-19 10:59 UTC (permalink / raw)
  To: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, jarkko, dave.hansen
  Cc: tglx, Du, Fan, Christopherson,, Sean, Zhang, Cathy, bp

On Fri, 2022-10-14 at 00:19 +0000, Huang, Kai wrote:
> On Thu, 2022-10-13 at 16:57 -0700, Dave Hansen wrote:
> > On 10/13/22 16:40, Huang, Kai wrote:
> > > What's your suggestion?
> > 
> > I'm fresh out of suggestions.
> 
> Thanks Dave.  I'll dig more and find out a solution.
> 

Hi Dave,

Not related to this series, I think there's a bug in the current virtual EPC
driver page fault handler in case of fork() (sorry didn't notice it at the first
place):

static int __sgx_vepc_fault(struct sgx_vepc *vepc,
                            struct vm_area_struct *vma, unsigned long addr)
{
	......
        /* Calculate index of EPC page in virtual EPC's page_array */
        index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);

        epc_page = xa_load(&vepc->page_array, index);
        if (epc_page)
                return 0;

	...
}

As you can see if the EPC page has already been populated at a given index of
one virtual EPC instance, the current fault handler just assumes the mapping is
already there and returns success immediately.  This causes a bug when one
virtual EPC instance is shared by multi processes via fork(): if the EPC page at
one index is already populated by the parent process, when the child accesses
the same page using different virtual address, the fault handler just returns
success w/o actually setting up the mapping for the child, resulting in endless
page fault.

This needs to be fixed in no matter what way.

My thinking is we can just enforce in the kernel that one virtual EPC instance
can only be mmap()-ed by the process which opens /dev/sgx_vepc (a new virtual
EPC instance is created on each open).  KVM userspace should never use fork() to
share virtual EPC instance otherwise userspace is using it in the wrong way. 
Thus enforcing in the kernel won't break any KVM userspace.

Does this make sense?

If it is OK to you, I'll send out a patch to fix.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-19 10:59                           ` Huang, Kai
@ 2022-10-23 20:39                             ` jarkko
  2022-10-24  1:32                               ` Zhiquan Li
  2022-10-24 22:23                               ` Huang, Kai
  0 siblings, 2 replies; 46+ messages in thread
From: jarkko @ 2022-10-23 20:39 UTC (permalink / raw)
  To: Huang, Kai
  Cc: linux-sgx, Luck, Tony, Li, Zhiquan1, Hansen, Dave, dave.hansen,
	tglx, Du, Fan, Christopherson,,
	Sean, Zhang, Cathy, bp

On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote:
> static int __sgx_vepc_fault(struct sgx_vepc *vepc,
>                             struct vm_area_struct *vma, unsigned long addr)
> {
> 	......
>         /* Calculate index of EPC page in virtual EPC's page_array */
>         index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> 
>         epc_page = xa_load(&vepc->page_array, index);
>         if (epc_page)
>                 return 0;
> 
> 	...
> }
> 
> As you can see if the EPC page has already been populated at a given index of
> one virtual EPC instance, the current fault handler just assumes the mapping is
> already there and returns success immediately.  This causes a bug when one
> virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> one index is already populated by the parent process, when the child accesses
> the same page using different virtual address, the fault handler just returns
> success w/o actually setting up the mapping for the child, resulting in endless
> page fault.
> 
> This needs to be fixed in no matter what way.

I think you mean that vm_insert_pfn() does not happen for child because
of early return? I did not understand the part about "different virtual
addresses", as it is the same mapping.

R, Jarkko

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-23 20:39                             ` jarkko
@ 2022-10-24  1:32                               ` Zhiquan Li
  2022-11-01  0:46                                 ` jarkko
  2022-10-24 22:23                               ` Huang, Kai
  1 sibling, 1 reply; 46+ messages in thread
From: Zhiquan Li @ 2022-10-24  1:32 UTC (permalink / raw)
  To: jarkko, Huang, Kai
  Cc: linux-sgx, Luck, Tony, Hansen, Dave, dave.hansen, tglx, Du, Fan,
	Christopherson,,
	Sean, Zhang, Cathy, bp



On 2022/10/24 04:39, jarkko@kernel.org wrote:
>> As you can see if the EPC page has already been populated at a given index of
>> one virtual EPC instance, the current fault handler just assumes the mapping is
>> already there and returns success immediately.  This causes a bug when one
>> virtual EPC instance is shared by multi processes via fork(): if the EPC page at
>> one index is already populated by the parent process, when the child accesses
>> the same page using different virtual address, the fault handler just returns
>> success w/o actually setting up the mapping for the child, resulting in endless
>> page fault.
>>
>> This needs to be fixed in no matter what way.
> I think you mean that vm_insert_pfn() does not happen for child because
> of early return? I did not understand the part about "different virtual
> addresses", as it is the same mapping.
> 

If userspace do something like this, the child will get "different
virtual address":

  ... parent run enclave within VM ...
  if (fork() == 0) {
      int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0);
      printf("child: %d\n", caddr[0]);
  }


- "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc.
- mmap() will create a VMA in child with "different virtual addresses".
- "caddr[0]" will cause a page fault as it's a new mapping.

1. Then kernel will run into the code snippet referenced by Kai.
2. The early return 0 will result in sgx_vepc_fault() return
   "VM_FAULT_NOPAGE".
3. This return value will make the condition as true at
   function do_user_addr_fault()

   if (likely(!(fault & VM_FAULT_ERROR)))
       return;

4. Since this page fault has not been handled and "$RIP" is still the
   original value, it will result in the same page fault again.  Namely,
   it's an endless page fault.

But the problem is neither the early return in __sgx_vepc_fault() nor
the return of VM_FAULT_NOPAGE at sgx_vepc_fault().  The root cause has
been highlighted by Kai, one virtual EPC instance
can only be mmap()-ed by the process which opens /dev/sgx_vepc.

In fact, to share a virtual EPC instance in userspace doesn't make any
sense.  Even though it can be shared by child, the virtual EPC page
cannot be used by child correctly.

Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-23 20:39                             ` jarkko
  2022-10-24  1:32                               ` Zhiquan Li
@ 2022-10-24 22:23                               ` Huang, Kai
  2022-11-01  0:53                                 ` jarkko
  1 sibling, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-10-24 22:23 UTC (permalink / raw)
  To: jarkko
  Cc: Hansen, Dave, linux-sgx, Christopherson,,
	Sean, dave.hansen, bp, Zhang, Cathy, tglx, Luck, Tony, Du, Fan,
	Li, Zhiquan1

On Sun, 2022-10-23 at 23:39 +0300, jarkko@kernel.org wrote:
> On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote:
> > static int __sgx_vepc_fault(struct sgx_vepc *vepc,
> >                             struct vm_area_struct *vma, unsigned long addr)
> > {
> > 	......
> >         /* Calculate index of EPC page in virtual EPC's page_array */
> >         index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> > 
> >         epc_page = xa_load(&vepc->page_array, index);
> >         if (epc_page)
> >                 return 0;
> > 
> > 	...
> > }
> > 
> > As you can see if the EPC page has already been populated at a given index of
> > one virtual EPC instance, the current fault handler just assumes the mapping is
> > already there and returns success immediately.  This causes a bug when one
> > virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> > one index is already populated by the parent process, when the child accesses
> > the same page using different virtual address, the fault handler just returns
> > success w/o actually setting up the mapping for the child, resulting in endless
> > page fault.
> > 
> > This needs to be fixed in no matter what way.
> 
> I think you mean that vm_insert_pfn() does not happen for child because
> of early return?
> 

Yes exactly.  Sorry for not pointing out directly.

> I did not understand the part about "different virtual
> addresses", as it is the same mapping.

The child can use mmap() to get a new mapping.  Whether the virtual address is
different from the parent's doesn't matter actually.


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-24  1:32                               ` Zhiquan Li
@ 2022-11-01  0:46                                 ` jarkko
  2022-11-02  1:38                                   ` Zhiquan Li
  2022-11-04 10:17                                   ` Huang, Kai
  0 siblings, 2 replies; 46+ messages in thread
From: jarkko @ 2022-11-01  0:46 UTC (permalink / raw)
  To: Zhiquan Li
  Cc: Huang, Kai, linux-sgx, Luck, Tony, Hansen, Dave, dave.hansen,
	tglx, Du, Fan, Christopherson,,
	Sean, Zhang, Cathy, bp

On Mon, Oct 24, 2022 at 09:32:13AM +0800, Zhiquan Li wrote:
> 
> 
> On 2022/10/24 04:39, jarkko@kernel.org wrote:
> >> As you can see if the EPC page has already been populated at a given index of
> >> one virtual EPC instance, the current fault handler just assumes the mapping is
> >> already there and returns success immediately.  This causes a bug when one
> >> virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> >> one index is already populated by the parent process, when the child accesses
> >> the same page using different virtual address, the fault handler just returns
> >> success w/o actually setting up the mapping for the child, resulting in endless
> >> page fault.
> >>
> >> This needs to be fixed in no matter what way.
> > I think you mean that vm_insert_pfn() does not happen for child because
> > of early return? I did not understand the part about "different virtual
> > addresses", as it is the same mapping.
> > 
> 
> If userspace do something like this, the child will get "different
> virtual address":
> 
>   ... parent run enclave within VM ...
>   if (fork() == 0) {
>       int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0);
>       printf("child: %d\n", caddr[0]);
>   }
> 
> 
> - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc.
> - mmap() will create a VMA in child with "different virtual addresses".
> - "caddr[0]" will cause a page fault as it's a new mapping.
> 
> 1. Then kernel will run into the code snippet referenced by Kai.
> 2. The early return 0 will result in sgx_vepc_fault() return
>    "VM_FAULT_NOPAGE".
> 3. This return value will make the condition as true at
>    function do_user_addr_fault()
> 
>    if (likely(!(fault & VM_FAULT_ERROR)))
>        return;
> 
> 4. Since this page fault has not been handled and "$RIP" is still the
>    original value, it will result in the same page fault again.  Namely,
>    it's an endless page fault.
> 
> But the problem is neither the early return in __sgx_vepc_fault() nor
> the return of VM_FAULT_NOPAGE at sgx_vepc_fault().  The root cause has
> been highlighted by Kai, one virtual EPC instance
> can only be mmap()-ed by the process which opens /dev/sgx_vepc.
> 
> In fact, to share a virtual EPC instance in userspace doesn't make any
> sense.  Even though it can be shared by child, the virtual EPC page
> cannot be used by child correctly.

OK, makes sense, thanks for the explanation!

Why would we want to enforce for user space not to do this, even
if it does cause malfunctioning program?

BR, Jarkko

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-10-24 22:23                               ` Huang, Kai
@ 2022-11-01  0:53                                 ` jarkko
  0 siblings, 0 replies; 46+ messages in thread
From: jarkko @ 2022-11-01  0:53 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Hansen, Dave, linux-sgx, Christopherson,,
	Sean, dave.hansen, bp, Zhang, Cathy, tglx, Luck, Tony, Du, Fan,
	Li, Zhiquan1

On Mon, Oct 24, 2022 at 10:23:10PM +0000, Huang, Kai wrote:
> On Sun, 2022-10-23 at 23:39 +0300, jarkko@kernel.org wrote:
> > On Wed, Oct 19, 2022 at 10:59:20AM +0000, Huang, Kai wrote:
> > > static int __sgx_vepc_fault(struct sgx_vepc *vepc,
> > >                             struct vm_area_struct *vma, unsigned long addr)
> > > {
> > > 	......
> > >         /* Calculate index of EPC page in virtual EPC's page_array */
> > >         index = vma->vm_pgoff + PFN_DOWN(addr - vma->vm_start);
> > > 
> > >         epc_page = xa_load(&vepc->page_array, index);
> > >         if (epc_page)
> > >                 return 0;
> > > 
> > > 	...
> > > }
> > > 
> > > As you can see if the EPC page has already been populated at a given index of
> > > one virtual EPC instance, the current fault handler just assumes the mapping is
> > > already there and returns success immediately.  This causes a bug when one
> > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> > > one index is already populated by the parent process, when the child accesses
> > > the same page using different virtual address, the fault handler just returns
> > > success w/o actually setting up the mapping for the child, resulting in endless
> > > page fault.
> > > 
> > > This needs to be fixed in no matter what way.
> > 
> > I think you mean that vm_insert_pfn() does not happen for child because
> > of early return?
> > 
> 
> Yes exactly.  Sorry for not pointing out directly.

Np.

> 
> > I did not understand the part about "different virtual
> > addresses", as it is the same mapping.
> 
> The child can use mmap() to get a new mapping.  Whether the virtual address is
> different from the parent's doesn't matter actually.

Thanks for the response, I had one additional query responded to Zhiquan.

BR, Jarkko

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-01  0:46                                 ` jarkko
@ 2022-11-02  1:38                                   ` Zhiquan Li
  2022-11-07 11:36                                     ` jarkko
  2022-11-04 10:17                                   ` Huang, Kai
  1 sibling, 1 reply; 46+ messages in thread
From: Zhiquan Li @ 2022-11-02  1:38 UTC (permalink / raw)
  To: jarkko
  Cc: Huang, Kai, linux-sgx, Luck, Tony, Hansen, Dave, dave.hansen,
	tglx, Du, Fan, Christopherson,,
	Sean, Zhang, Cathy, bp


On 2022/11/1 08:46, jarkko@kernel.org wrote:
> Why would we want to enforce for user space not to do this, even
> if it does cause malfunctioning program?
> 

We want to resolve the problem at the source rather than just deal with
the symptom passively derived from it. For instance, we might be able to
return VM_FAULT_SIGBUS to kill the malicious application, but if the
malicious child touch the memory earlier than parent despite it cannot
use the virtual EPC page, then the parent will be victim.

Even thought it's not a security threaten, there is no practical
significance for sharing a virtual EPC instance. So we would like to
prevent it from the beginning.

Best Regards,
Zhiquan

> BR, Jarkko

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-01  0:46                                 ` jarkko
  2022-11-02  1:38                                   ` Zhiquan Li
@ 2022-11-04 10:17                                   ` Huang, Kai
  2022-11-04 16:26                                     ` Sean Christopherson
  1 sibling, 1 reply; 46+ messages in thread
From: Huang, Kai @ 2022-11-04 10:17 UTC (permalink / raw)
  To: Li, Zhiquan1, jarkko
  Cc: Hansen, Dave, linux-sgx, Christopherson,,
	Sean, dave.hansen, bp, Zhang, Cathy, tglx, Luck, Tony, Du, Fan

On Tue, 2022-11-01 at 02:46 +0200, jarkko@kernel.org wrote:
> On Mon, Oct 24, 2022 at 09:32:13AM +0800, Zhiquan Li wrote:
> > 
> > 
> > On 2022/10/24 04:39, jarkko@kernel.org wrote:
> > > > As you can see if the EPC page has already been populated at a given index of
> > > > one virtual EPC instance, the current fault handler just assumes the mapping is
> > > > already there and returns success immediately.  This causes a bug when one
> > > > virtual EPC instance is shared by multi processes via fork(): if the EPC page at
> > > > one index is already populated by the parent process, when the child accesses
> > > > the same page using different virtual address, the fault handler just returns
> > > > success w/o actually setting up the mapping for the child, resulting in endless
> > > > page fault.
> > > > 
> > > > This needs to be fixed in no matter what way.
> > > I think you mean that vm_insert_pfn() does not happen for child because
> > > of early return? I did not understand the part about "different virtual
> > > addresses", as it is the same mapping.
> > > 
> > 
> > If userspace do something like this, the child will get "different
> > virtual address":
> > 
> >   ... parent run enclave within VM ...
> >   if (fork() == 0) {
> >       int *caddr = mmap(NULL, 4096, PROT_READ, MAP_SHARED, vepc_fd, 0);
> >       printf("child: %d\n", caddr[0]);
> >   }
> > 
> > 
> > - "vepc_fd" is inherited from parent which had opened /dev/sgx_vepc.
> > - mmap() will create a VMA in child with "different virtual addresses".
> > - "caddr[0]" will cause a page fault as it's a new mapping.
> > 
> > 1. Then kernel will run into the code snippet referenced by Kai.
> > 2. The early return 0 will result in sgx_vepc_fault() return
> >    "VM_FAULT_NOPAGE".
> > 3. This return value will make the condition as true at
> >    function do_user_addr_fault()
> > 
> >    if (likely(!(fault & VM_FAULT_ERROR)))
> >        return;
> > 
> > 4. Since this page fault has not been handled and "$RIP" is still the
> >    original value, it will result in the same page fault again.  Namely,
> >    it's an endless page fault.
> > 
> > But the problem is neither the early return in __sgx_vepc_fault() nor
> > the return of VM_FAULT_NOPAGE at sgx_vepc_fault().  The root cause has
> > been highlighted by Kai, one virtual EPC instance
> > can only be mmap()-ed by the process which opens /dev/sgx_vepc.
> > 
> > In fact, to share a virtual EPC instance in userspace doesn't make any
> > sense.  Even though it can be shared by child, the virtual EPC page
> > cannot be used by child correctly.
> 
> OK, makes sense, thanks for the explanation!
> 
> Why would we want to enforce for user space not to do this, even
> if it does cause malfunctioning program?
> 
> BR, Jarkko

Hi Jarkko, Dave,

I've been re-thinking about this #MC handle on virtual EPC by stepping back to
the beginning, and I think we have more problems than this "whether kernel
should enforce child cannot mmap() virtual EPC".

First of all, if we want to use epc->owner to carry the userspace virtual
address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good
enough -- nothing prevents userspace to call mmap() several times to map the
same virtual EPC pages.  So additionally, we also need to "make kernel enforce
one virtual EPC can only be mmap()-ed once".

Secondly, I am thinking that the current arch_memory_failure() cannot really
handle #MC for virtual EPC page correctly.  The problem is, even we mark the
page as poisoned, and send signal to userspace to inject #MC to guest to handle,
the poisoned virtual EPC page is never unmapped from the guest and then freed.

This means a malicious guest can always try to use the poisoned EPC page again
after it receives #MC on some EPC page.  I am not entirely sure what kind
behaviour/attack can be done in such case, but it seems the right behaviour
should be the KVM to inject the #MC and unmap the poisoned EPC page from guest.
And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should
kill the guest.

Hi Sean,

If you ever see this, could you also comment?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-04 10:17                                   ` Huang, Kai
@ 2022-11-04 16:26                                     ` Sean Christopherson
  2022-11-04 16:34                                       ` Dave Hansen
  2022-11-07  8:54                                       ` Huang, Kai
  0 siblings, 2 replies; 46+ messages in thread
From: Sean Christopherson @ 2022-11-04 16:26 UTC (permalink / raw)
  To: Huang, Kai
  Cc: Li, Zhiquan1, jarkko, Hansen, Dave, linux-sgx, dave.hansen, bp,
	Zhang, Cathy, tglx, Luck, Tony, Du, Fan

On Fri, Nov 04, 2022, Huang, Kai wrote:
> > > In fact, to share a virtual EPC instance in userspace doesn't make any
> > > sense.  Even though it can be shared by child, the virtual EPC page
> > > cannot be used by child correctly.
> > 
> > OK, makes sense, thanks for the explanation!
> > 
> > Why would we want to enforce for user space not to do this, even
> > if it does cause malfunctioning program?
> > 
> > BR, Jarkko
> 
> Hi Jarkko, Dave,
> 
> I've been re-thinking about this #MC handle on virtual EPC by stepping back to
> the beginning, and I think we have more problems than this "whether kernel
> should enforce child cannot mmap() virtual EPC".

IMO, virtual EPC should be restricted to a single mm_struct, which is what was
originally proposed many years ago[*].  I should have pushed back harder, but by
that point I had mostly stopped caring about SGX.

There is no use case for sharing a virtual EPC, and conceptually it just doesn't
make sense because all use would need to be mutually exclusive on a per-page basis
to keep the EPCM happy.

[*] https://lore.kernel.org/kvm/ace9d4cb10318370f6145aaced0cfa73dda36477.1609890536.git.kai.huang@intel.com

> First of all, if we want to use epc->owner to carry the userspace virtual
> address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good
> enough -- nothing prevents userspace to call mmap() several times to map the
> same virtual EPC pages.  So additionally, we also need to "make kernel enforce
> one virtual EPC can only be mmap()-ed once".
> 
> Secondly, I am thinking that the current arch_memory_failure() cannot really
> handle #MC for virtual EPC page correctly.  The problem is, even we mark the
> page as poisoned, and send signal to userspace to inject #MC to guest to handle,
> the poisoned virtual EPC page is never unmapped from the guest and then freed.
> 
> This means a malicious guest can always try to use the poisoned EPC page again
> after it receives #MC on some EPC page.  I am not entirely sure what kind
> behaviour/attack can be done in such case, but it seems the right behaviour
> should be the KVM to inject the #MC and unmap the poisoned EPC page from guest.
> And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should
> kill the guest.

Just zap the PTEs for the affected mm_struct, mmu_notifier => KVM will do the rest.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-04 16:26                                     ` Sean Christopherson
@ 2022-11-04 16:34                                       ` Dave Hansen
  2022-11-07  8:55                                         ` Huang, Kai
  2022-11-07  8:54                                       ` Huang, Kai
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Hansen @ 2022-11-04 16:34 UTC (permalink / raw)
  To: Sean Christopherson, Huang, Kai
  Cc: Li, Zhiquan1, jarkko, linux-sgx, dave.hansen, bp, Zhang, Cathy,
	tglx, Luck, Tony, Du, Fan

On 11/4/22 09:26, Sean Christopherson wrote:
>> I've been re-thinking about this #MC handle on virtual EPC by stepping back to
>> the beginning, and I think we have more problems than this "whether kernel
>> should enforce child cannot mmap() virtual EPC".
> IMO, virtual EPC should be restricted to a single mm_struct, which is what was
> originally proposed many years ago[*].  I should have pushed back harder, but by
> that point I had mostly stopped caring about SGX.

Considering that we have VM_DONTCOPY set on the vepc VMA, this shouldn't
be too hard to pull off.  We could just return -EBUSY if another mm
comes around and tries to mmap() the fd.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-04 16:26                                     ` Sean Christopherson
  2022-11-04 16:34                                       ` Dave Hansen
@ 2022-11-07  8:54                                       ` Huang, Kai
  1 sibling, 0 replies; 46+ messages in thread
From: Huang, Kai @ 2022-11-07  8:54 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: Hansen, Dave, linux-sgx, Zhang, Cathy, dave.hansen, jarkko, bp,
	tglx, Luck, Tony, Du, Fan, Li, Zhiquan1

On Fri, 2022-11-04 at 16:26 +0000, Sean Christopherson wrote:
> On Fri, Nov 04, 2022, Huang, Kai wrote:
> > > > In fact, to share a virtual EPC instance in userspace doesn't make any
> > > > sense.  Even though it can be shared by child, the virtual EPC page
> > > > cannot be used by child correctly.
> > > 
> > > OK, makes sense, thanks for the explanation!
> > > 
> > > Why would we want to enforce for user space not to do this, even
> > > if it does cause malfunctioning program?
> > > 
> > > BR, Jarkko
> > 
> > Hi Jarkko, Dave,
> > 
> > I've been re-thinking about this #MC handle on virtual EPC by stepping back to
> > the beginning, and I think we have more problems than this "whether kernel
> > should enforce child cannot mmap() virtual EPC".
> 
> IMO, virtual EPC should be restricted to a single mm_struct, which is what was
> originally proposed many years ago[*].  I should have pushed back harder, but by
> that point I had mostly stopped caring about SGX.
> 
> There is no use case for sharing a virtual EPC, and conceptually it just doesn't
> make sense because all use would need to be mutually exclusive on a per-page basis
> to keep the EPCM happy.
> 
> [*] https://lore.kernel.org/kvm/ace9d4cb10318370f6145aaced0cfa73dda36477.1609890536.git.kai.huang@intel.com

Thanks for your time.  Yes I agree.

Also, IMHO we can further enforce to only allow associating one VMA with one
virtual EPC instance.  It also doesn't make sense to have multi-mappings (VMAs)
to the same virtual EPC pages even for one mm_struct, right?  

This also makes zapping PTE easier (as mentioned below).

> 
> > First of all, if we want to use epc->owner to carry the userspace virtual
> > address, "make kernel enforce child cannot mmap() virtual EPC" alone isn't good
> > enough -- nothing prevents userspace to call mmap() several times to map the
> > same virtual EPC pages.  So additionally, we also need to "make kernel enforce
> > one virtual EPC can only be mmap()-ed once".
> > 
> > Secondly, I am thinking that the current arch_memory_failure() cannot really
> > handle #MC for virtual EPC page correctly.  The problem is, even we mark the
> > page as poisoned, and send signal to userspace to inject #MC to guest to handle,
> > the poisoned virtual EPC page is never unmapped from the guest and then freed.
> > 
> > This means a malicious guest can always try to use the poisoned EPC page again
> > after it receives #MC on some EPC page.  I am not entirely sure what kind
> > behaviour/attack can be done in such case, but it seems the right behaviour
> > should be the KVM to inject the #MC and unmap the poisoned EPC page from guest.
> > And if guest ever tries to use this "guest's EPC page" (GFN) again, KVM should
> > kill the guest.
> 
> Just zap the PTEs for the affected mm_struct, mmu_notifier => KVM will do the rest.

Zapping PTEs alone can, i.e. take the poisoned virtual EPC page away from the
guest and map a new one, but it doesn't inject #MC to the guest.  W/o injecting
#MC, the guest will just see sudden lose of enclave which consumes that EPC page
(enclave is marked as bad by hardware if any page caused #MC).

But I guess it's also acceptable since we already have other cases that guest
can possibly see sudden lose of enclave (i.e. live migration)?

Btw, currently the virtual EPC driver doesn't have infrastructure to zap PTEs. 
If we can associate one virtual EPC instance to one VMA, then zapping PTE can be
easily done by bookkeeping this VMA.

Does this look good?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-04 16:34                                       ` Dave Hansen
@ 2022-11-07  8:55                                         ` Huang, Kai
  0 siblings, 0 replies; 46+ messages in thread
From: Huang, Kai @ 2022-11-07  8:55 UTC (permalink / raw)
  To: Hansen, Dave, Christopherson,, Sean
  Cc: linux-sgx, Zhang, Cathy, dave.hansen, jarkko, bp, tglx, Luck,
	Tony, Du, Fan, Li, Zhiquan1

On Fri, 2022-11-04 at 09:34 -0700, Dave Hansen wrote:
> On 11/4/22 09:26, Sean Christopherson wrote:
> > > I've been re-thinking about this #MC handle on virtual EPC by stepping back to
> > > the beginning, and I think we have more problems than this "whether kernel
> > > should enforce child cannot mmap() virtual EPC".
> > IMO, virtual EPC should be restricted to a single mm_struct, which is what was
> > originally proposed many years ago[*].  I should have pushed back harder, but by
> > that point I had mostly stopped caring about SGX.
> 
> Considering that we have VM_DONTCOPY set on the vepc VMA, this shouldn't
> be too hard to pull off.  We could just return -EBUSY if another mm
> comes around and tries to mmap() the fd.

Yes.  We can record the MM which opens /dev/sgx_vepc and reject mmap() from
other MMs.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-02  1:38                                   ` Zhiquan Li
@ 2022-11-07 11:36                                     ` jarkko
  2022-11-07 12:19                                       ` Zhiquan Li
  0 siblings, 1 reply; 46+ messages in thread
From: jarkko @ 2022-11-07 11:36 UTC (permalink / raw)
  To: Zhiquan Li
  Cc: Huang, Kai, linux-sgx, Luck, Tony, Hansen, Dave, dave.hansen,
	tglx, Du, Fan, Christopherson,,
	Sean, Zhang, Cathy, bp

On Wed, Nov 02, 2022 at 09:38:55AM +0800, Zhiquan Li wrote:
> 
> On 2022/11/1 08:46, jarkko@kernel.org wrote:
> > Why would we want to enforce for user space not to do this, even
> > if it does cause malfunctioning program?
> > 
> 
> We want to resolve the problem at the source rather than just deal with
> the symptom passively derived from it. For instance, we might be able to
> return VM_FAULT_SIGBUS to kill the malicious application, but if the
> malicious child touch the memory earlier than parent despite it cannot
> use the virtual EPC page, then the parent will be victim.
> 
> Even thought it's not a security threaten, there is no practical
> significance for sharing a virtual EPC instance. So we would like to
> prevent it from the beginning.

Can you phrase this to the commit message? This makes sense as a
motivation.

BR, Jarkko


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization
  2022-11-07 11:36                                     ` jarkko
@ 2022-11-07 12:19                                       ` Zhiquan Li
  0 siblings, 0 replies; 46+ messages in thread
From: Zhiquan Li @ 2022-11-07 12:19 UTC (permalink / raw)
  To: jarkko
  Cc: Huang, Kai, linux-sgx, Luck, Tony, Hansen, Dave, dave.hansen,
	tglx, Du, Fan, Christopherson,,
	Sean, Zhang, Cathy, bp



On 2022/11/7 19:36, jarkko@kernel.org wrote:
> Can you phrase this to the commit message? This makes sense as a
> motivation.
> 

No problem, I will add it to v10 patch after current concerns are
clarified and get a clear solution.

Best Regards,
Zhiquan

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-11-07 12:11 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-20  6:39 [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
2022-09-20  6:39 ` [PATCH v9 1/3] x86/sgx: Rename the owner field of struct sgx_epc_page as encl_owner Zhiquan Li
2022-09-20  6:39 ` [PATCH v9 2/3] x86/sgx: Introduce union with vepc_vaddr field for virtualization case Zhiquan Li
2022-10-10 23:10   ` Dave Hansen
2022-10-11  5:49     ` Zhiquan Li
2022-10-11 13:57       ` Dave Hansen
2022-10-12  4:42         ` Zhiquan Li
2022-10-12 11:17           ` Huang, Kai
2022-09-20  6:39 ` [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization Zhiquan Li
2022-10-10 23:20   ` Dave Hansen
2022-10-11  4:44     ` Zhiquan Li
2022-10-11 14:04   ` Dave Hansen
2022-10-12  5:09     ` Zhiquan Li
2022-10-12 11:01       ` Huang, Kai
2022-10-12 11:54         ` jarkko
2022-10-12 20:56           ` Huang, Kai
2022-10-13  2:05         ` Zhiquan Li
2022-10-12 14:36       ` Dave Hansen
2022-10-13 14:40         ` Zhiquan Li
2022-10-13 15:39           ` Dave Hansen
2022-10-14  5:42             ` Zhiquan Li
2022-10-14  5:41               ` Dave Hansen
2022-10-13 15:44           ` Dave Hansen
2022-10-13 21:49             ` Huang, Kai
2022-10-13 22:02               ` Dave Hansen
2022-10-13 22:15                 ` Huang, Kai
2022-10-13 22:28                   ` Dave Hansen
2022-10-13 23:40                     ` Huang, Kai
2022-10-13 23:57                       ` Dave Hansen
2022-10-14  0:19                         ` Huang, Kai
2022-10-19 10:59                           ` Huang, Kai
2022-10-23 20:39                             ` jarkko
2022-10-24  1:32                               ` Zhiquan Li
2022-11-01  0:46                                 ` jarkko
2022-11-02  1:38                                   ` Zhiquan Li
2022-11-07 11:36                                     ` jarkko
2022-11-07 12:19                                       ` Zhiquan Li
2022-11-04 10:17                                   ` Huang, Kai
2022-11-04 16:26                                     ` Sean Christopherson
2022-11-04 16:34                                       ` Dave Hansen
2022-11-07  8:55                                         ` Huang, Kai
2022-11-07  8:54                                       ` Huang, Kai
2022-10-24 22:23                               ` Huang, Kai
2022-11-01  0:53                                 ` jarkko
2022-09-29  8:05 ` [PATCH v9 0/3] x86/sgx: fine grained SGX MCA behavior Zhiquan Li
2022-10-08  2:29 ` Zhiquan Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).