kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
@ 2021-05-17 20:07 Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails Claudio Imbrenda
                   ` (11 more replies)
  0 siblings, 12 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs).  This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.

This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.

There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)

For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.

For the shutdown case, a list of pages to be destroyed is formed when
the mm is torn down. Instead of just unmapping the pages when the
address space is being torn down, they are also set aside. Later when
KVM cleans up the VM, a thread is started to clean up the pages from
the list.

This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.

Claudio Imbrenda (11):
  KVM: s390: pv: leak the ASCE page when destroy fails
  KVM: s390: pv: properly handle page flags for protected guests
  KVM: s390: pv: handle secure storage violations for protected guests
  KVM: s390: pv: handle secure storage exceptions for normal guests
  KVM: s390: pv: refactor s390_reset_acc
  KVM: s390: pv: usage counter instead of flag
  KVM: s390: pv: add export before import
  KVM: s390: pv: lazy destroy for reboot
  KVM: s390: pv: extend lazy destroy to handle shutdown
  KVM: s390: pv: module parameter to fence lazy destroy
  KVM: s390: pv: add support for UV feature bits

 arch/s390/boot/uv.c                 |   1 +
 arch/s390/include/asm/gmap.h        |   5 +-
 arch/s390/include/asm/mmu.h         |   3 +
 arch/s390/include/asm/mmu_context.h |   2 +
 arch/s390/include/asm/pgtable.h     |  16 +-
 arch/s390/include/asm/uv.h          |  35 ++++-
 arch/s390/kernel/uv.c               | 133 +++++++++++++++-
 arch/s390/kvm/kvm-s390.c            |   6 +-
 arch/s390/kvm/kvm-s390.h            |   2 +-
 arch/s390/kvm/pv.c                  | 230 ++++++++++++++++++++++++++--
 arch/s390/mm/fault.c                |  22 ++-
 arch/s390/mm/gmap.c                 |  86 +++++++----
 12 files changed, 490 insertions(+), 51 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-18 10:26   ` Janosch Frank
  2021-05-17 20:07 ` [PATCH v1 02/11] KVM: s390: pv: properly handle page flags for protected guests Claudio Imbrenda
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

When the destroy configuration UVC fails, the page pointed to by the
ASCE of the VM becomes poisoned, and, to avoid issues it must not be
used again.

Since the page becomes in practice unusable, we set it aside and leak it.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/pv.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 813b6e93dc83..e0532ab725bf 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -150,6 +150,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
 	return -ENOMEM;
 }
 
+/*
+ * Remove the topmost level of page tables from the list of page tables of
+ * the gmap.
+ * This means that it will not be freed when the VM is torn down, and needs
+ * to be handled separately by the caller, unless an intentional leak is
+ * intended.
+ */
+static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
+{
+	struct page *old;
+
+	old = virt_to_page(kvm->arch.gmap->table);
+	list_del(&old->lru);
+	/* in case the ASCE needs to be "removed" multiple times */
+	INIT_LIST_HEAD(&old->lru);
+}
+
+/*
+ * Try to replace the current ASCE with another equivalent one.
+ * If the allocation of the new top level page table fails, the ASCE is not
+ * replaced.
+ * In any case, the old ASCE is removed from the list, therefore the caller
+ * has to make sure to save a pointer to it beforehands, unless an
+ * intentional leak is intended.
+ */
+static int kvm_s390_pv_replace_asce(struct kvm *kvm)
+{
+	unsigned long asce;
+	struct page *page;
+	void *table;
+
+	kvm_s390_pv_remove_old_asce(kvm);
+
+	page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+	if (!page)
+		return -ENOMEM;
+	list_add(&page->lru, &kvm->arch.gmap->crst_list);
+
+	table = page_to_virt(page);
+	memcpy(table, kvm->arch.gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
+
+	asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
+	WRITE_ONCE(kvm->arch.gmap->asce, asce);
+	WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
+	WRITE_ONCE(kvm->arch.gmap->table, table);
+
+	return 0;
+}
+
 /* this should not fail, but if it does, we must not free the donated memory */
 int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 {
@@ -164,9 +213,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 	atomic_set(&kvm->mm->context.is_protected, 0);
 	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
 	WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
-	/* Inteded memory leak on "impossible" error */
+	/* Intended memory leak on "impossible" error */
 	if (!cc)
 		kvm_s390_pv_dealloc_vm(kvm);
+	else
+		kvm_s390_pv_replace_asce(kvm);
 	return cc ? -EIO : 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 02/11] KVM: s390: pv: properly handle page flags for protected guests
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 03/11] KVM: s390: pv: handle secure storage violations " Claudio Imbrenda
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.

These new functions can only be called on pages for which a reference
is already being held.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h |  9 ++++++---
 arch/s390/include/asm/uv.h      | 10 ++++++++--
 arch/s390/kernel/uv.c           | 34 ++++++++++++++++++++++++++++++++-
 arch/s390/mm/gmap.c             |  4 +++-
 4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 29c7ecd5ad1d..141a8aaacd6d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1062,8 +1062,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 	pte_t res;
 
 	res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+	/* At this point the reference through the mapping is still present */
 	if (mm_is_protected(mm) && pte_present(res))
-		uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+		uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
 	return res;
 }
 
@@ -1079,8 +1080,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
 	pte_t res;
 
 	res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
+	/* At this point the reference through the mapping is still present */
 	if (mm_is_protected(vma->vm_mm) && pte_present(res))
-		uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+		uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
 	return res;
 }
 
@@ -1104,8 +1106,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 	} else {
 		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
 	}
+	/* At this point the reference through the mapping is still present */
 	if (mm_is_protected(mm) && pte_present(res))
-		uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+		uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
 	return res;
 }
 
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 7b98d4caee77..9aa621e84745 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -345,8 +345,9 @@ static inline int is_prot_virt_host(void)
 }
 
 int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
-int uv_destroy_page(unsigned long paddr);
+int uv_destroy_owned_page(unsigned long paddr);
 int uv_convert_from_secure(unsigned long paddr);
+int uv_convert_owned_from_secure(unsigned long paddr);
 int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
 
 void setup_uv(void);
@@ -356,7 +357,7 @@ void adjust_to_uv_max(unsigned long *vmax);
 static inline void setup_uv(void) {}
 static inline void adjust_to_uv_max(unsigned long *vmax) {}
 
-static inline int uv_destroy_page(unsigned long paddr)
+static inline int uv_destroy_owned_page(unsigned long paddr)
 {
 	return 0;
 }
@@ -365,6 +366,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
 {
 	return 0;
 }
+
+static inline int uv_convert_owned_from_secure(unsigned long paddr)
+{
+	return 0;
+}
 #endif
 
 #if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index b2d2ad153067..3d94760c0371 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -121,7 +121,7 @@ static int uv_pin_shared(unsigned long paddr)
  *
  * @paddr: Absolute host address of page to be destroyed
  */
-int uv_destroy_page(unsigned long paddr)
+static int uv_destroy_page(unsigned long paddr)
 {
 	struct uv_cb_cfs uvcb = {
 		.header.cmd = UVC_CMD_DESTR_SEC_STOR,
@@ -141,6 +141,22 @@ int uv_destroy_page(unsigned long paddr)
 	return 0;
 }
 
+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_destroy_owned_page(unsigned long paddr)
+{
+	struct page *page = phys_to_page(paddr);
+	int rc;
+
+	get_page(page);
+	rc = uv_destroy_page(paddr);
+	if (!rc)
+		clear_bit(PG_arch_1, &page->flags);
+	put_page(page);
+	return rc;
+}
+
 /*
  * Requests the Ultravisor to encrypt a guest page and make it
  * accessible to the host for paging (export).
@@ -160,6 +176,22 @@ int uv_convert_from_secure(unsigned long paddr)
 	return 0;
 }
 
+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_convert_owned_from_secure(unsigned long paddr)
+{
+	struct page *page = phys_to_page(paddr);
+	int rc;
+
+	get_page(page);
+	rc = uv_convert_from_secure(paddr);
+	if (!rc)
+		clear_bit(PG_arch_1, &page->flags);
+	put_page(page);
+	return rc;
+}
+
 /*
  * Calculate the expected ref_count for a page that would otherwise have no
  * further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..de679facc720 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
 {
 	pte_t pte = READ_ONCE(*ptep);
 
+	/* There is a reference through the mapping */
 	if (pte_present(pte))
-		WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
+		WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
+
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 03/11] KVM: s390: pv: handle secure storage violations for protected guests
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 02/11] KVM: s390: pv: properly handle page flags for protected guests Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 04/11] KVM: s390: pv: handle secure storage exceptions for normal guests Claudio Imbrenda
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation.

This patch adds handling of secure storage violations for protected
guests.

Pages that trigger the exception will be made non-secure before
attempting to use them again.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/include/asm/uv.h |  1 +
 arch/s390/kernel/uv.c      | 43 ++++++++++++++++++++++++++++++++++++++
 arch/s390/mm/fault.c       | 10 +++++++++
 3 files changed, 54 insertions(+)

diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 9aa621e84745..c6fe6a42e79b 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -345,6 +345,7 @@ static inline int is_prot_virt_host(void)
 }
 
 int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
 int uv_destroy_owned_page(unsigned long paddr);
 int uv_convert_from_secure(unsigned long paddr);
 int uv_convert_owned_from_secure(unsigned long paddr);
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 3d94760c0371..b19b1a1444ec 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -338,6 +338,49 @@ int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr)
 }
 EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
 
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
+{
+	struct vm_area_struct *vma;
+	unsigned long uaddr;
+	struct page *page;
+	int rc;
+
+	rc = -EFAULT;
+	mmap_read_lock(gmap->mm);
+
+	uaddr = __gmap_translate(gmap, gaddr);
+	if (IS_ERR_VALUE(uaddr))
+		goto out;
+	vma = find_vma(gmap->mm, uaddr);
+	if (!vma)
+		goto out;
+	/*
+	 * Huge pages should not be able to become secure
+	 */
+	if (is_vm_hugetlb_page(vma))
+		goto out;
+
+	rc = 0;
+	/* we take an extra reference here */
+	page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
+	if (IS_ERR_OR_NULL(page))
+		goto out;
+	rc = uv_destroy_owned_page(page_to_phys(page));
+	/*
+	 * Fault handlers can race; it is possible that one CPU will destroy
+	 * and import the page, at which point the second CPU handling the
+	 * same fault will not be able to destroy. In that case we do not
+	 * want to terminate the process, we instead try to export the page.
+	 */
+	if (rc)
+		rc = uv_convert_owned_from_secure(page_to_phys(page));
+	put_page(page);
+out:
+	mmap_read_unlock(gmap->mm);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_destroy_page);
+
 /*
  * To be called with the page locked or with an extra reference! This will
  * prevent gmap_make_secure from touching the page concurrently. Having 2
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index e30c7c781172..efdef35bc415 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -846,6 +846,16 @@ NOKPROBE_SYMBOL(do_non_secure_storage_access);
 
 void do_secure_storage_violation(struct pt_regs *regs)
 {
+	unsigned long gaddr = regs->int_parm_long & __FAIL_ADDR_MASK;
+	struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
+
+	/*
+	 * If the VM has been rebooted, its address space might still contain
+	 * secure pages from the previous boot.
+	 * Clear the page so it can be reused.
+	 */
+	if (!gmap_destroy_page(gmap, gaddr))
+		return;
 	/*
 	 * Either KVM messed up the secure guest mapping or the same
 	 * page is mapped into multiple secure guests.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 04/11] KVM: s390: pv: handle secure storage exceptions for normal guests
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (2 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 03/11] KVM: s390: pv: handle secure storage violations " Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 05/11] KVM: s390: pv: refactor s390_reset_acc Claudio Imbrenda
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

With upcoming patches, normal guests might touch secure pages.

This patch extends the existing exception handler to convert the pages
to non secure also when the exception is triggered by a normal guest.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/mm/fault.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index efdef35bc415..24a39457cd48 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -789,9 +789,20 @@ void do_secure_storage_access(struct pt_regs *regs)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	struct page *page;
+	struct gmap *gmap;
 	int rc;
 
 	switch (get_fault_type(regs)) {
+	case GMAP_FAULT:
+		gmap = (struct gmap *)S390_lowcore.gmap;
+		/*
+		 * Very unlikely, but if it happens, simply try again.
+		 * The next attempt will trigger a different exception.
+		 */
+		addr = __gmap_translate(gmap, addr);
+		if (addr == -EFAULT)
+			break;
+		fallthrough;
 	case USER_FAULT:
 		mm = current->mm;
 		mmap_read_lock(mm);
@@ -820,7 +831,6 @@ void do_secure_storage_access(struct pt_regs *regs)
 		if (rc)
 			BUG();
 		break;
-	case GMAP_FAULT:
 	default:
 		do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
 		WARN_ON_ONCE(1);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 05/11] KVM: s390: pv: refactor s390_reset_acc
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (3 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 04/11] KVM: s390: pv: handle secure storage exceptions for normal guests Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-26 12:11   ` Janosch Frank
  2021-05-17 20:07 ` [PATCH v1 06/11] KVM: s390: pv: usage counter instead of flag Claudio Imbrenda
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Refactor s390_reset_acc so that its pieces can be reused in upcoming
patches. The users parameter for s390_destroy_range will be needed in
upcoming patches.

We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/include/asm/gmap.h |  5 +-
 arch/s390/kvm/pv.c           | 12 ++++-
 arch/s390/mm/gmap.c          | 88 ++++++++++++++++++++++++------------
 3 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 40264f60b0da..618ddc455867 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -147,5 +147,8 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
 void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
 			     unsigned long gaddr, unsigned long vmaddr);
 int gmap_mark_unmergeable(void);
-void s390_reset_acc(struct mm_struct *mm);
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+			   unsigned long start, unsigned long end);
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
+
 #endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index e0532ab725bf..c3f9f30d2ed4 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -12,6 +12,8 @@
 #include <asm/gmap.h>
 #include <asm/uv.h>
 #include <asm/mman.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
 #include "kvm-s390.h"
 
 int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
@@ -204,8 +206,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 {
 	int cc;
 
-	/* make all pages accessible before destroying the guest */
-	s390_reset_acc(kvm->mm);
+	/*
+	 * if the mm still has a mapping, make all its pages accessible
+	 * before destroying the guest
+	 */
+	if (mmget_not_zero(kvm->mm)) {
+		s390_uv_destroy_range(kvm->mm, 0, 0, TASK_SIZE);
+		mmput(kvm->mm);
+	}
 
 	cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
 			   UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index de679facc720..ad210a6e2c41 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2670,41 +2670,71 @@ void s390_reset_cmma(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(s390_reset_cmma);
 
-/*
- * make inaccessible pages accessible again
- */
-static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
-			    unsigned long next, struct mm_walk *walk)
+#define DESTROY_LOOP_THRESHOLD 32
+
+struct reset_walk_state {
+	unsigned long next;
+	unsigned long count;
+	unsigned long pfns[DESTROY_LOOP_THRESHOLD];
+};
+
+static int s390_gather_pages(pte_t *ptep, unsigned long addr,
+			     unsigned long next, struct mm_walk *walk)
 {
+	struct reset_walk_state *p = walk->private;
 	pte_t pte = READ_ONCE(*ptep);
 
-	/* There is a reference through the mapping */
-	if (pte_present(pte))
-		WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
-
-	return 0;
+	if (pte_present(pte)) {
+		/* we have a reference from the mapping, take an extra one */
+		get_page(phys_to_page(pte_val(pte)));
+		p->pfns[p->count] = phys_to_pfn(pte_val(pte));
+		p->next = next;
+		p->count++;
+	}
+	return p->count >= DESTROY_LOOP_THRESHOLD;
 }
 
-static const struct mm_walk_ops reset_acc_walk_ops = {
-	.pte_entry		= __s390_reset_acc,
+static const struct mm_walk_ops gather_pages_ops = {
+	.pte_entry = s390_gather_pages,
 };
 
-#include <linux/sched/mm.h>
-void s390_reset_acc(struct mm_struct *mm)
+/*
+ * Call the Destroy secure page UVC on each page in the given array of PFNs.
+ * Each page needs to have an extra reference, which will be released here.
+ */
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
 {
-	if (!mm_is_protected(mm))
-		return;
-	/*
-	 * we might be called during
-	 * reset:                             we walk the pages and clear
-	 * close of all kvm file descriptors: we walk the pages and clear
-	 * exit of process on fd closure:     vma already gone, do nothing
-	 */
-	if (!mmget_not_zero(mm))
-		return;
-	mmap_read_lock(mm);
-	walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
-	mmap_read_unlock(mm);
-	mmput(mm);
+	unsigned long i;
+
+	for (i = 0; i < count; i++) {
+		/* we always have an extra reference */
+		uv_destroy_owned_page(pfn_to_phys(pfns[i]));
+		/* get rid of the extra reference */
+		put_page(pfn_to_page(pfns[i]));
+		cond_resched();
+	}
+}
+EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
+
+/*
+ * Walk the given range of the given address space, and call the destroy
+ * secure page UVC on each page.
+ * Exit early if the number of users of the mm drops to (or below) the given
+ * value.
+ */
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+			   unsigned long start, unsigned long end)
+{
+	struct reset_walk_state state = { .next = start };
+	int r = 1;
+
+	while ((r > 0) && (atomic_read(&mm->mm_users) > users)) {
+		state.count = 0;
+		mmap_read_lock(mm);
+		r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+		mmap_read_unlock(mm);
+		cond_resched();
+		s390_uv_destroy_pfns(state.count, state.pfns);
+	}
 }
-EXPORT_SYMBOL_GPL(s390_reset_acc);
+EXPORT_SYMBOL_GPL(s390_uv_destroy_range);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 06/11] KVM: s390: pv: usage counter instead of flag
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (4 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 05/11] KVM: s390: pv: refactor s390_reset_acc Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-27  9:29   ` Janosch Frank
  2021-05-17 20:07 ` [PATCH v1 07/11] KVM: s390: pv: add export before import Claudio Imbrenda
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Use the is_protected field as a counter instead of a flag. This will
be used in upcoming patches.

Increment the counter when a secure configuration is created, and
decrement it when it is destroyed. Previously the flag was set when the
set secure parameters UVC was performed.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/pv.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index c3f9f30d2ed4..59039b8a7be7 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -218,7 +218,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 	cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
 			   UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
 	WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
-	atomic_set(&kvm->mm->context.is_protected, 0);
+	if (!cc)
+		atomic_dec(&kvm->mm->context.is_protected);
 	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
 	WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
 	/* Intended memory leak on "impossible" error */
@@ -259,11 +260,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 	/* Outputs */
 	kvm->arch.pv.handle = uvcb.guest_handle;
 
+	atomic_inc(&kvm->mm->context.is_protected);
 	if (cc) {
-		if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
+		if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
 			kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
-		else
+		} else {
+			atomic_dec(&kvm->mm->context.is_protected);
 			kvm_s390_pv_dealloc_vm(kvm);
+		}
 		return -EIO;
 	}
 	kvm->arch.gmap->guest_handle = uvcb.guest_handle;
@@ -286,8 +290,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
 	*rrc = uvcb.header.rrc;
 	KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
 		     *rc, *rrc);
-	if (!cc)
-		atomic_set(&kvm->mm->context.is_protected, 1);
 	return cc ? -EINVAL : 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 07/11] KVM: s390: pv: add export before import
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (5 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 06/11] KVM: s390: pv: usage counter instead of flag Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-26 11:56   ` Janosch Frank
  2021-05-17 20:07 ` [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot Claudio Imbrenda
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Due to upcoming changes, it will be possible to temporarily have
multiple protected VMs in the same address space. When that happens,
it is necessary to perform an export of every page that is to be
imported.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kernel/uv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index b19b1a1444ec..dbcf4434eb53 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -242,6 +242,12 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
 	return rc;
 }
 
+static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
+{
+	return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+		atomic_read(&mm->context.is_protected) > 1;
+}
+
 /*
  * Requests the Ultravisor to make a page accessible to a guest.
  * If it's brought in the first time, it will be cleared. If
@@ -285,6 +291,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
 
 	lock_page(page);
 	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+	if (should_export_before_import(uvcb, gmap->mm))
+		uv_convert_from_secure(page_to_phys(page));
 	rc = make_secure_pte(ptep, uaddr, page, uvcb);
 	pte_unmap_unlock(ptep, ptelock);
 	unlock_page(page);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (6 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 07/11] KVM: s390: pv: add export before import Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-27  9:43   ` Janosch Frank
  2021-05-17 20:07 ` [PATCH v1 09/11] KVM: s390: pv: extend lazy destroy to handle shutdown Claudio Imbrenda
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.

This patch implements a lazy destroy mechanism, that allows a protected
guest to reboot significantly faster than previously.

This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.

The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/kvm-s390.c |   6 +-
 arch/s390/kvm/kvm-s390.h |   2 +-
 arch/s390/kvm/pv.c       | 118 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2f09e9d7dc95..db25aa1fb6a6 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2248,7 +2248,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
 
 		r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
 		if (r)
-			kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+			kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);
 
 		/* we need to block service interrupts from now on */
 		set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2267,7 +2267,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
 		 */
 		if (r)
 			break;
-		r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
+		r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);
 
 		/* no need to block service interrupts any more */
 		clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2796,7 +2796,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	 * complaining we do not use kvm_s390_pv_is_protected.
 	 */
 	if (kvm_s390_pv_get_handle(kvm))
-		kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
+		kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
 	debug_unregister(kvm->arch.dbf);
 	free_page((unsigned long)kvm->arch.sie_page2);
 	if (!kvm_is_ucontrol(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 79dcd647b378..b3c0796a3cc1 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
 /* implemented in pv.c */
 int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
 int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
 int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
 int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
 			      u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 59039b8a7be7..9a3547966e18 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -14,8 +14,17 @@
 #include <asm/mman.h>
 #include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
+#include <linux/kthread.h>
 #include "kvm-s390.h"
 
+struct deferred_priv {
+	struct mm_struct *mm;
+	unsigned long old_table;
+	u64 handle;
+	void *virt;
+	unsigned long real;
+};
+
 int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
 {
 	int cc = 0;
@@ -202,7 +211,7 @@ static int kvm_s390_pv_replace_asce(struct kvm *kvm)
 }
 
 /* this should not fail, but if it does, we must not free the donated memory */
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
+static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
 {
 	int cc;
 
@@ -230,6 +239,111 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 	return cc ? -EIO : 0;
 }
 
+static int kvm_s390_pv_destroy_vm_thread(void *priv)
+{
+	struct deferred_priv *p = priv;
+	u16 rc, rrc;
+	int r = 1;
+
+	/* Exit early if we end up being the only users of the mm */
+	s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+	mmput(p->mm);
+
+	r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+	WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+	if (r)
+		return r;
+	atomic_dec(&p->mm->context.is_protected);
+
+	/*
+	 * Intentional leak in case the destroy secure VM call fails. The
+	 * call should never fail if the hardware is not broken.
+	 */
+	free_pages(p->real, get_order(uv_info.guest_base_stor_len));
+	free_pages(p->old_table, CRST_ALLOC_ORDER);
+	vfree(p->virt);
+	kfree(p);
+	return 0;
+}
+
+static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
+{
+	struct task_struct *t;
+
+	priv->virt = kvm->arch.pv.stor_var;
+	priv->real = kvm->arch.pv.stor_base;
+	priv->handle = kvm_s390_pv_get_handle(kvm);
+	priv->old_table = (unsigned long)kvm->arch.gmap->table;
+	WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+
+	if (kvm_s390_pv_replace_asce(kvm))
+		goto fail;
+
+	t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
+			   "kvm_s390_pv_destroy_vm_thread");
+	if (IS_ERR_OR_NULL(t))
+		goto fail;
+
+	memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
+	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
+	wake_up_process(t);
+	/*
+	 * no actual UVC is performed at this point, just return a successful
+	 * rc value to make userspace happy, and an arbitrary rrc
+	 */
+	*rc = 1;
+	*rrc = 42;
+
+	return 0;
+
+fail:
+	kfree(priv);
+	return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+}
+
+/*  Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	unsigned long lim;
+	int idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	slots = kvm_memslots(kvm);
+	kvm_for_each_memslot(slot, slots) {
+		if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
+			continue;
+		if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
+			lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+		else
+			lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+		s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+	struct deferred_priv *priv;
+
+	priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+	if (!priv)
+		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
+	if (mmget_not_zero(kvm->mm)) {
+		kvm_s390_clear_2g(kvm);
+	} else {
+		/* No deferred work to do */
+		kfree(priv);
+		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+	}
+	priv->mm = kvm->mm;
+	return deferred_destroy(kvm, priv, rc, rrc);
+}
+
 int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 {
 	struct uv_cb_cgc uvcb = {
@@ -263,7 +377,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 	atomic_inc(&kvm->mm->context.is_protected);
 	if (cc) {
 		if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
-			kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+			kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
 		} else {
 			atomic_dec(&kvm->mm->context.is_protected);
 			kvm_s390_pv_dealloc_vm(kvm);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 09/11] KVM: s390: pv: extend lazy destroy to handle shutdown
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (7 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-17 20:07 ` [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy Claudio Imbrenda
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Extend the lazy destroy infrastructure to handle almost all kinds of
exit of the userspace process. The only case not handled is if the
process is killed by the OOM, in which case the old behaviour will
still be in effect.

Add the uv_destroy_page_lazy function to set aside pages when unmapping
them during mm teardown; the pages will be processed and freed when the
protected VM is torn down.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/include/asm/mmu.h         |  3 ++
 arch/s390/include/asm/mmu_context.h |  2 ++
 arch/s390/include/asm/pgtable.h     |  9 ++++--
 arch/s390/include/asm/uv.h          | 15 +++++++++
 arch/s390/kernel/uv.c               | 47 +++++++++++++++++++++++++++++
 arch/s390/kvm/pv.c                  | 34 +++++++++++++++++----
 6 files changed, 102 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index e12ff0f29d1a..e2250dbe3d9d 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -18,6 +18,7 @@ typedef struct {
 	unsigned long vdso_base;
 	/* The mmu context belongs to a secure guest. */
 	atomic_t is_protected;
+	struct list_head deferred_list;
 	/*
 	 * The following bitfields need a down_write on the mm
 	 * semaphore when they are written to. As they are only
@@ -34,6 +35,8 @@ typedef struct {
 	unsigned int uses_cmm:1;
 	/* The gmaps associated with this context are allowed to use huge pages. */
 	unsigned int allow_gmap_hpage_1m:1;
+	/* The mmu context should be destroyed synchronously */
+	unsigned int pv_sync_destroy:1;
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(name)						   \
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index e7cffc7b5c2f..da62140404e8 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -27,8 +27,10 @@ static inline int init_new_context(struct task_struct *tsk,
 	cpumask_clear(&mm->context.cpu_attach_mask);
 	atomic_set(&mm->context.flush_count, 0);
 	atomic_set(&mm->context.is_protected, 0);
+	mm->context.pv_sync_destroy = 0;
 	mm->context.gmap_asce = 0;
 	mm->context.flush_mm = 0;
+	INIT_LIST_HEAD(&mm->context.deferred_list);
 #ifdef CONFIG_PGSTE
 	mm->context.alloc_pgste = page_table_allocate_pgste ||
 		test_thread_flag(TIF_PGSTE) ||
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 141a8aaacd6d..05c68d3c2256 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1106,9 +1106,14 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 	} else {
 		res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
 	}
+
 	/* At this point the reference through the mapping is still present */
-	if (mm_is_protected(mm) && pte_present(res))
-		uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+	if (pte_present(res) && mm_is_protected(mm)) {
+		if (full && !mm->context.pv_sync_destroy)
+			uv_destroy_page_lazy(mm, pte_val(res) & PAGE_MASK);
+		else
+			uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+	}
 	return res;
 }
 
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index c6fe6a42e79b..1de5ff5e192a 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -339,6 +339,15 @@ static inline int uv_remove_shared(unsigned long addr) { return 0; }
 #if IS_ENABLED(CONFIG_KVM)
 extern int prot_virt_host;
 
+struct destroy_page_lazy {
+	struct list_head list;
+	unsigned short count;
+	unsigned long pfns[];
+};
+
+/* This guarantees that up to PV_MAX_LAZY_COUNT can fit in a page */
+#define PV_MAX_LAZY_COUNT ((PAGE_SIZE - sizeof(struct destroy_page_lazy)) / sizeof(long))
+
 static inline int is_prot_virt_host(void)
 {
 	return prot_virt_host;
@@ -347,6 +356,7 @@ static inline int is_prot_virt_host(void)
 int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
 int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
 int uv_destroy_owned_page(unsigned long paddr);
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr);
 int uv_convert_from_secure(unsigned long paddr);
 int uv_convert_owned_from_secure(unsigned long paddr);
 int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
@@ -363,6 +373,11 @@ static inline int uv_destroy_owned_page(unsigned long paddr)
 	return 0;
 }
 
+static inline int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+	return 0;
+}
+
 static inline int uv_convert_from_secure(unsigned long paddr)
 {
 	return 0;
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index dbcf4434eb53..434d81baceed 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -192,6 +192,53 @@ int uv_convert_owned_from_secure(unsigned long paddr)
 	return rc;
 }
 
+/*
+ * Set aside the given page and put it in the list of pages to be cleared in
+ * background. The caller must already hold a reference to the page.
+ */
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+	struct list_head *head = &mm->context.deferred_list;
+	struct destroy_page_lazy *lazy;
+	struct page *page;
+	int rc;
+
+	/* get an extra reference here */
+	get_page(phys_to_page(paddr));
+
+	lazy = list_first_entry(head, struct destroy_page_lazy, list);
+	/*
+	 * We need a fresh page to store more pointers. The current page
+	 * might be shared, so it cannot be used directly. Instead, make it
+	 * accessible and release it, and let the normal unmap code free it
+	 * later, if needed.
+	 * Afterwards, try to allocate a new page, but not very hard. If the
+	 * allocation fails, we simply return. The next call to this
+	 * function will attempt to do the same again, until enough pages
+	 * have been freed.
+	 */
+	if (list_empty(head) || lazy->count >= PV_MAX_LAZY_COUNT) {
+		rc = uv_convert_owned_from_secure(paddr);
+		/* in case of failure, we intentionally leak the page */
+		if (rc)
+			return rc;
+		/* release the extra reference */
+		put_page(phys_to_page(paddr));
+
+		/* try to allocate a new page quickly, but allow failures */
+		page = alloc_page(GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN);
+		if (!page)
+			return -ENOMEM;
+		lazy = page_to_virt(page);
+		lazy->count = 0;
+		list_add(&lazy->list, head);
+		return 0;
+	}
+	/* the array of pointers has space, just add this entry */
+	lazy->pfns[lazy->count++] = phys_to_pfn(paddr);
+	return 0;
+}
+
 /*
  * Calculate the expected ref_count for a page that would otherwise have no
  * further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 9a3547966e18..4333d3e54ef0 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -19,6 +19,7 @@
 
 struct deferred_priv {
 	struct mm_struct *mm;
+	bool has_mm;
 	unsigned long old_table;
 	u64 handle;
 	void *virt;
@@ -241,13 +242,29 @@ static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
 
 static int kvm_s390_pv_destroy_vm_thread(void *priv)
 {
+	struct destroy_page_lazy *lazy, *next;
 	struct deferred_priv *p = priv;
 	u16 rc, rrc;
-	int r = 1;
+	int r;
 
-	/* Exit early if we end up being the only users of the mm */
-	s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
-	mmput(p->mm);
+	list_for_each_entry_safe(lazy, next, &p->mm->context.deferred_list, list) {
+		list_del(&lazy->list);
+		s390_uv_destroy_pfns(lazy->count, lazy->pfns);
+		free_page(__pa(lazy));
+	}
+
+	if (p->has_mm) {
+		/* Exit early if we end up being the only users of the mm */
+		s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+		if (atomic_read(&p->mm->mm_users) == 1) {
+			mmap_write_lock(p->mm);
+			/* destroy synchronously if there are no other users */
+			p->mm->context.pv_sync_destroy = 1;
+			mmap_write_unlock(p->mm);
+		}
+		mmput(p->mm);
+	}
+	mmdrop(p->mm);
 
 	r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
 	WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
@@ -276,7 +293,9 @@ static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc
 	priv->old_table = (unsigned long)kvm->arch.gmap->table;
 	WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
 
-	if (kvm_s390_pv_replace_asce(kvm))
+	if (!priv->has_mm)
+		kvm_s390_pv_remove_old_asce(kvm);
+	else if (kvm_s390_pv_replace_asce(kvm))
 		goto fail;
 
 	t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
@@ -333,10 +352,13 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
 	if (!priv)
 		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
 
+	mmgrab(kvm->mm);
 	if (mmget_not_zero(kvm->mm)) {
+		priv->has_mm = true;
 		kvm_s390_clear_2g(kvm);
-	} else {
+	} else if (list_empty(&kvm->mm->context.deferred_list)) {
 		/* No deferred work to do */
+		mmdrop(kvm->mm);
 		kfree(priv);
 		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (8 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 09/11] KVM: s390: pv: extend lazy destroy to handle shutdown Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-27 10:35   ` Janosch Frank
  2021-05-17 20:07 ` [PATCH v1 11/11] KVM: s390: pv: add support for UV feature bits Claudio Imbrenda
  2021-05-18 15:05 ` [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Cornelia Huck
  11 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Add the module parameter "lazy_destroy", to allow the lazy destroy
mechanism to be switched off. This might be useful for debugging
purposes.

The parameter is enabled by default.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/kvm/pv.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 4333d3e54ef0..00c14406205f 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -26,6 +26,10 @@ struct deferred_priv {
 	unsigned long real;
 };
 
+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
+
 int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
 {
 	int cc = 0;
@@ -348,6 +352,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
 {
 	struct deferred_priv *priv;
 
+	if (!lazy_destroy)
+		kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
 	priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
 	if (!priv)
 		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
@@ -396,6 +403,12 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
 	/* Outputs */
 	kvm->arch.pv.handle = uvcb.guest_handle;
 
+	if (!lazy_destroy) {
+		mmap_write_lock(kvm->mm);
+		kvm->mm->context.pv_sync_destroy = 1;
+		mmap_write_unlock(kvm->mm);
+	}
+
 	atomic_inc(&kvm->mm->context.is_protected);
 	if (cc) {
 		if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v1 11/11] KVM: s390: pv: add support for UV feature bits
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (9 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy Claudio Imbrenda
@ 2021-05-17 20:07 ` Claudio Imbrenda
  2021-05-18 15:05 ` [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Cornelia Huck
  11 siblings, 0 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-17 20:07 UTC (permalink / raw)
  To: kvm
  Cc: cohuck, borntraeger, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

Add support for Ultravisor feature bits, and take advantage of the
functionality advertised to speed up the lazy destroy mechanism.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 arch/s390/boot/uv.c        | 1 +
 arch/s390/include/asm/uv.h | 9 ++++++++-
 arch/s390/kernel/uv.c      | 3 ++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/s390/boot/uv.c b/arch/s390/boot/uv.c
index 87641dd65ccf..efdb55364919 100644
--- a/arch/s390/boot/uv.c
+++ b/arch/s390/boot/uv.c
@@ -36,6 +36,7 @@ void uv_query_info(void)
 		uv_info.max_sec_stor_addr = ALIGN(uvcb.max_guest_stor_addr, PAGE_SIZE);
 		uv_info.max_num_sec_conf = uvcb.max_num_sec_conf;
 		uv_info.max_guest_cpu_id = uvcb.max_guest_cpu_id;
+		uv_info.feature_bits = uvcb.uv_feature_bits;
 	}
 
 #ifdef CONFIG_PROTECTED_VIRTUALIZATION_GUEST
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 1de5ff5e192a..808c2f0e0d7c 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -73,6 +73,11 @@ enum uv_cmds_inst {
 	BIT_UVC_CMD_UNPIN_PAGE_SHARED = 22,
 };
 
+/* Bits in uv features field */
+enum uv_features {
+	BIT_UVC_FEAT_MISC_0 = 0,
+};
+
 struct uv_cb_header {
 	u16 len;
 	u16 cmd;	/* Command Code */
@@ -97,7 +102,8 @@ struct uv_cb_qui {
 	u64 max_guest_stor_addr;
 	u8  reserved88[158 - 136];
 	u16 max_guest_cpu_id;
-	u8  reserveda0[200 - 160];
+	u64 uv_feature_bits;
+	u8  reserveda0[200 - 168];
 } __packed __aligned(8);
 
 /* Initialize Ultravisor */
@@ -274,6 +280,7 @@ struct uv_info {
 	unsigned long max_sec_stor_addr;
 	unsigned int max_num_sec_conf;
 	unsigned short max_guest_cpu_id;
+	unsigned long feature_bits;
 };
 
 extern struct uv_info uv_info;
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 434d81baceed..b1fa38d5c388 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -291,7 +291,8 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
 
 static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
 {
-	return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+	return !test_bit_inv(BIT_UVC_FEAT_MISC_0, &uv_info.feature_bits) &&
+		uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
 		atomic_read(&mm->context.is_protected) > 1;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails
  2021-05-17 20:07 ` [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails Claudio Imbrenda
@ 2021-05-18 10:26   ` Janosch Frank
  2021-05-18 10:40     ` Claudio Imbrenda
  0 siblings, 1 reply; 34+ messages in thread
From: Janosch Frank @ 2021-05-18 10:26 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> When the destroy configuration UVC fails, the page pointed to by the
> ASCE of the VM becomes poisoned, and, to avoid issues it must not be
> used again.
> 
> Since the page becomes in practice unusable, we set it aside and leak it.

I think we need something a bit more specific.

On creation of a protected guest the top most level of page tables are
marked by the Ultravisor and can only be used as top level page tables
for the protected guest that was created. If another protected guest
would re-use those pages for its top level page tables the UV would
throw errors.

When a destroy fails the UV will not remove the markings so these pages
are basically unusable since we can't guarantee that they won't be used
for a guest ASCE in the future.

Hence we choose to leak those pages in the very unlikely event that a
destroy fails.


LGTM

> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  arch/s390/kvm/pv.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 52 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 813b6e93dc83..e0532ab725bf 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -150,6 +150,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
>  	return -ENOMEM;
>  }
>  
> +/*
> + * Remove the topmost level of page tables from the list of page tables of
> + * the gmap.
> + * This means that it will not be freed when the VM is torn down, and needs
> + * to be handled separately by the caller, unless an intentional leak is
> + * intended.
> + */
> +static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
> +{
> +	struct page *old;
> +
> +	old = virt_to_page(kvm->arch.gmap->table);
> +	list_del(&old->lru);
> +	/* in case the ASCE needs to be "removed" multiple times */
> +	INIT_LIST_HEAD(&old->lru);

?

> +}
> +
> +/*
> + * Try to replace the current ASCE with another equivalent one.
> + * If the allocation of the new top level page table fails, the ASCE is not
> + * replaced.
> + * In any case, the old ASCE is removed from the list, therefore the caller
> + * has to make sure to save a pointer to it beforehands, unless an
> + * intentional leak is intended.
> + */
> +static int kvm_s390_pv_replace_asce(struct kvm *kvm)
> +{
> +	unsigned long asce;
> +	struct page *page;
> +	void *table;
> +
> +	kvm_s390_pv_remove_old_asce(kvm);
> +
> +	page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
> +	if (!page)
> +		return -ENOMEM;
> +	list_add(&page->lru, &kvm->arch.gmap->crst_list);
> +
> +	table = page_to_virt(page);
> +	memcpy(table, kvm->arch.gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
> +
> +	asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
> +	WRITE_ONCE(kvm->arch.gmap->asce, asce);
> +	WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
> +	WRITE_ONCE(kvm->arch.gmap->table, table);
> +
> +	return 0;
> +}
> +
>  /* this should not fail, but if it does, we must not free the donated memory */
>  int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  {
> @@ -164,9 +213,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  	atomic_set(&kvm->mm->context.is_protected, 0);
>  	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
>  	WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
> -	/* Inteded memory leak on "impossible" error */
> +	/* Intended memory leak on "impossible" error */
>  	if (!cc)
>  		kvm_s390_pv_dealloc_vm(kvm);
> +	else
> +		kvm_s390_pv_replace_asce(kvm);
>  	return cc ? -EIO : 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails
  2021-05-18 10:26   ` Janosch Frank
@ 2021-05-18 10:40     ` Claudio Imbrenda
  2021-05-18 12:00       ` Janosch Frank
  0 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 10:40 UTC (permalink / raw)
  To: Janosch Frank
  Cc: kvm, cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On Tue, 18 May 2021 12:26:51 +0200
Janosch Frank <frankja@linux.ibm.com> wrote:

> On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> > When the destroy configuration UVC fails, the page pointed to by the
> > ASCE of the VM becomes poisoned, and, to avoid issues it must not be
> > used again.
> > 
> > Since the page becomes in practice unusable, we set it aside and
> > leak it.  
> 
> I think we need something a bit more specific.
> 
> On creation of a protected guest the top most level of page tables are
> marked by the Ultravisor and can only be used as top level page tables
> for the protected guest that was created. If another protected guest
> would re-use those pages for its top level page tables the UV would
> throw errors.
> 
> When a destroy fails the UV will not remove the markings so these
> pages are basically unusable since we can't guarantee that they won't
> be used for a guest ASCE in the future.
> 
> Hence we choose to leak those pages in the very unlikely event that a
> destroy fails.

it's more than that. the top level page, once marked, also cannot be
used as backing for the virtual and real memory areas donated with the
create secure configuration and create secure cpu UVCs.

and there might also other circumstances in which that page cannot be
used that I am not aware of

> 
> LGTM
> 
> > 
> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> > ---
> >  arch/s390/kvm/pv.c | 53
> > +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52
> > insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> > index 813b6e93dc83..e0532ab725bf 100644
> > --- a/arch/s390/kvm/pv.c
> > +++ b/arch/s390/kvm/pv.c
> > @@ -150,6 +150,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm
> > *kvm) return -ENOMEM;
> >  }
> >  
> > +/*
> > + * Remove the topmost level of page tables from the list of page
> > tables of
> > + * the gmap.
> > + * This means that it will not be freed when the VM is torn down,
> > and needs
> > + * to be handled separately by the caller, unless an intentional
> > leak is
> > + * intended.
> > + */
> > +static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
> > +{
> > +	struct page *old;
> > +
> > +	old = virt_to_page(kvm->arch.gmap->table);
> > +	list_del(&old->lru);
> > +	/* in case the ASCE needs to be "removed" multiple times */
> > +	INIT_LIST_HEAD(&old->lru);  
> 
> ?
> 
> > +}
> > +
> > +/*
> > + * Try to replace the current ASCE with another equivalent one.
> > + * If the allocation of the new top level page table fails, the
> > ASCE is not
> > + * replaced.
> > + * In any case, the old ASCE is removed from the list, therefore
> > the caller
> > + * has to make sure to save a pointer to it beforehands, unless an
> > + * intentional leak is intended.
> > + */
> > +static int kvm_s390_pv_replace_asce(struct kvm *kvm)
> > +{
> > +	unsigned long asce;
> > +	struct page *page;
> > +	void *table;
> > +
> > +	kvm_s390_pv_remove_old_asce(kvm);
> > +
> > +	page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
> > +	if (!page)
> > +		return -ENOMEM;
> > +	list_add(&page->lru, &kvm->arch.gmap->crst_list);
> > +
> > +	table = page_to_virt(page);
> > +	memcpy(table, kvm->arch.gmap->table, 1UL <<
> > (CRST_ALLOC_ORDER + PAGE_SHIFT)); +
> > +	asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
> > +	WRITE_ONCE(kvm->arch.gmap->asce, asce);
> > +	WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
> > +	WRITE_ONCE(kvm->arch.gmap->table, table);
> > +
> > +	return 0;
> > +}
> > +
> >  /* this should not fail, but if it does, we must not free the
> > donated memory */ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
> > *rc, u16 *rrc) {
> > @@ -164,9 +213,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
> > *rc, u16 *rrc) atomic_set(&kvm->mm->context.is_protected, 0);
> >  	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x",
> > *rc, *rrc); WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc
> > %x", *rc, *rrc);
> > -	/* Inteded memory leak on "impossible" error */
> > +	/* Intended memory leak on "impossible" error */
> >  	if (!cc)
> >  		kvm_s390_pv_dealloc_vm(kvm);
> > +	else
> > +		kvm_s390_pv_replace_asce(kvm);
> >  	return cc ? -EIO : 0;
> >  }
> >  
> >   
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails
  2021-05-18 10:40     ` Claudio Imbrenda
@ 2021-05-18 12:00       ` Janosch Frank
  0 siblings, 0 replies; 34+ messages in thread
From: Janosch Frank @ 2021-05-18 12:00 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: kvm, cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/18/21 12:40 PM, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 12:26:51 +0200
> Janosch Frank <frankja@linux.ibm.com> wrote:
> 
>> On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
>>> When the destroy configuration UVC fails, the page pointed to by the
>>> ASCE of the VM becomes poisoned, and, to avoid issues it must not be
>>> used again.
>>>
>>> Since the page becomes in practice unusable, we set it aside and
>>> leak it.  
>>
>> I think we need something a bit more specific.
>>
>> On creation of a protected guest the top most level of page tables are
>> marked by the Ultravisor and can only be used as top level page tables
>> for the protected guest that was created. If another protected guest
>> would re-use those pages for its top level page tables the UV would
>> throw errors.
>>
>> When a destroy fails the UV will not remove the markings so these
>> pages are basically unusable since we can't guarantee that they won't
>> be used for a guest ASCE in the future.
>>
>> Hence we choose to leak those pages in the very unlikely event that a
>> destroy fails.
> 
> it's more than that. the top level page, once marked, also cannot be
> used as backing for the virtual and real memory areas donated with the
> create secure configuration and create secure cpu UVCs.
> 
> and there might also other circumstances in which that page cannot be
> used that I am not aware of
> 

Even more reason to document it :)

>>
>> LGTM
>>
>>>
>>> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
>>> ---
>>>  arch/s390/kvm/pv.c | 53
>>> +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52
>>> insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
>>> index 813b6e93dc83..e0532ab725bf 100644
>>> --- a/arch/s390/kvm/pv.c
>>> +++ b/arch/s390/kvm/pv.c
>>> @@ -150,6 +150,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm
>>> *kvm) return -ENOMEM;
>>>  }
>>>  
>>> +/*
>>> + * Remove the topmost level of page tables from the list of page
>>> tables of
>>> + * the gmap.
>>> + * This means that it will not be freed when the VM is torn down,
>>> and needs
>>> + * to be handled separately by the caller, unless an intentional
>>> leak is
>>> + * intended.
>>> + */
>>> +static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
>>> +{
>>> +	struct page *old;
>>> +
>>> +	old = virt_to_page(kvm->arch.gmap->table);
>>> +	list_del(&old->lru);
>>> +	/* in case the ASCE needs to be "removed" multiple times */
>>> +	INIT_LIST_HEAD(&old->lru);  
>>
>> ?
>>
>>> +}
>>> +
>>> +/*
>>> + * Try to replace the current ASCE with another equivalent one.
>>> + * If the allocation of the new top level page table fails, the
>>> ASCE is not
>>> + * replaced.
>>> + * In any case, the old ASCE is removed from the list, therefore
>>> the caller
>>> + * has to make sure to save a pointer to it beforehands, unless an
>>> + * intentional leak is intended.
>>> + */
>>> +static int kvm_s390_pv_replace_asce(struct kvm *kvm)
>>> +{
>>> +	unsigned long asce;
>>> +	struct page *page;
>>> +	void *table;
>>> +
>>> +	kvm_s390_pv_remove_old_asce(kvm);
>>> +
>>> +	page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
>>> +	if (!page)
>>> +		return -ENOMEM;
>>> +	list_add(&page->lru, &kvm->arch.gmap->crst_list);
>>> +
>>> +	table = page_to_virt(page);
>>> +	memcpy(table, kvm->arch.gmap->table, 1UL <<
>>> (CRST_ALLOC_ORDER + PAGE_SHIFT)); +
>>> +	asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
>>> +	WRITE_ONCE(kvm->arch.gmap->asce, asce);
>>> +	WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
>>> +	WRITE_ONCE(kvm->arch.gmap->table, table);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>  /* this should not fail, but if it does, we must not free the
>>> donated memory */ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
>>> *rc, u16 *rrc) {
>>> @@ -164,9 +213,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
>>> *rc, u16 *rrc) atomic_set(&kvm->mm->context.is_protected, 0);
>>>  	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x",
>>> *rc, *rrc); WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc
>>> %x", *rc, *rrc);
>>> -	/* Inteded memory leak on "impossible" error */
>>> +	/* Intended memory leak on "impossible" error */
>>>  	if (!cc)
>>>  		kvm_s390_pv_dealloc_vm(kvm);
>>> +	else
>>> +		kvm_s390_pv_replace_asce(kvm);
>>>  	return cc ? -EIO : 0;
>>>  }
>>>  
>>>   
>>
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
                   ` (10 preceding siblings ...)
  2021-05-17 20:07 ` [PATCH v1 11/11] KVM: s390: pv: add support for UV feature bits Claudio Imbrenda
@ 2021-05-18 15:05 ` Cornelia Huck
  2021-05-18 15:36   ` Claudio Imbrenda
  11 siblings, 1 reply; 34+ messages in thread
From: Cornelia Huck @ 2021-05-18 15:05 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: kvm, borntraeger, frankja, thuth, pasic, david, linux-s390, linux-kernel

On Mon, 17 May 2021 22:07:47 +0200
Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> Previously, when a protected VM was rebooted or when it was shut down,
> its memory was made unprotected, and then the protected VM itself was
> destroyed. Looping over the whole address space can take some time,
> considering the overhead of the various Ultravisor Calls (UVCs).  This
> means that a reboot or a shutdown would take a potentially long amount
> of time, depending on the amount of used memory.
> 
> This patchseries implements a deferred destroy mechanism for protected
> guests. When a protected guest is destroyed, its memory is cleared in
> background, allowing the guest to restart or terminate significantly
> faster than before.
> 
> There are 2 possibilities when a protected VM is torn down:
> * it still has an address space associated (reboot case)
> * it does not have an address space anymore (shutdown case)
> 
> For the reboot case, the reference count of the mm is increased, and
> then a background thread is started to clean up. Once the thread went
> through the whole address space, the protected VM is actually
> destroyed.
> 
> For the shutdown case, a list of pages to be destroyed is formed when
> the mm is torn down. Instead of just unmapping the pages when the
> address space is being torn down, they are also set aside. Later when
> KVM cleans up the VM, a thread is started to clean up the pages from
> the list.

Just to make sure, 'clean up' includes doing uv calls?

> 
> This means that the same address space can have memory belonging to
> more than one protected guest, although only one will be running, the
> others will in fact not even have any CPUs.

Are those set-aside-but-not-yet-cleaned-up pages still possibly
accessible in any way? I would assume that they only belong to the
'zombie' guests, and any new or rebooted guest is a new entity that
needs to get new pages?

Can too many not-yet-cleaned-up pages lead to a (temporary) memory
exhaustion?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 15:05 ` [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Cornelia Huck
@ 2021-05-18 15:36   ` Claudio Imbrenda
  2021-05-18 15:45     ` Christian Borntraeger
  2021-05-18 16:04     ` Cornelia Huck
  0 siblings, 2 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 15:36 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, borntraeger, frankja, thuth, pasic, david, linux-s390, linux-kernel

On Tue, 18 May 2021 17:05:37 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Mon, 17 May 2021 22:07:47 +0200
> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> 
> > Previously, when a protected VM was rebooted or when it was shut
> > down, its memory was made unprotected, and then the protected VM
> > itself was destroyed. Looping over the whole address space can take
> > some time, considering the overhead of the various Ultravisor Calls
> > (UVCs).  This means that a reboot or a shutdown would take a
> > potentially long amount of time, depending on the amount of used
> > memory.
> > 
> > This patchseries implements a deferred destroy mechanism for
> > protected guests. When a protected guest is destroyed, its memory
> > is cleared in background, allowing the guest to restart or
> > terminate significantly faster than before.
> > 
> > There are 2 possibilities when a protected VM is torn down:
> > * it still has an address space associated (reboot case)
> > * it does not have an address space anymore (shutdown case)
> > 
> > For the reboot case, the reference count of the mm is increased, and
> > then a background thread is started to clean up. Once the thread
> > went through the whole address space, the protected VM is actually
> > destroyed.
> > 
> > For the shutdown case, a list of pages to be destroyed is formed
> > when the mm is torn down. Instead of just unmapping the pages when
> > the address space is being torn down, they are also set aside.
> > Later when KVM cleans up the VM, a thread is started to clean up
> > the pages from the list.  
> 
> Just to make sure, 'clean up' includes doing uv calls?

yes

> > 
> > This means that the same address space can have memory belonging to
> > more than one protected guest, although only one will be running,
> > the others will in fact not even have any CPUs.  
> 
> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> accessible in any way? I would assume that they only belong to the

in case of reboot: yes, they are still in the address space of the
guest, and can be swapped if needed

> 'zombie' guests, and any new or rebooted guest is a new entity that
> needs to get new pages?

the rebooted guest (normal or secure) will re-use the same pages of the
old guest (before or after cleanup, which is the reason of patches 3
and 4)

the KVM guest is not affected in case of reboot, so the userspace
address space is not touched.

> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> exhaustion?

in case of reboot, not much; the pages were in use are still in use
after the reboot, and they can be swapped.

in case of a shutdown, yes, because the pages are really taken aside
and cleared/destroyed in background. they cannot be swapped. they are
freed immediately as they are processed, to try to mitigate memory
exhaustion scenarios.

in the end, this patchseries is a tradeoff between speed and memory
consumption. the memory needs to be cleared up at some point, and that
requires time.

in cases where this might be an issue, I introduced a new KVM flag to
disable lazy destroy (patch 10)



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 15:36   ` Claudio Imbrenda
@ 2021-05-18 15:45     ` Christian Borntraeger
  2021-05-18 15:52       ` Cornelia Huck
  2021-05-18 16:13       ` Claudio Imbrenda
  2021-05-18 16:04     ` Cornelia Huck
  1 sibling, 2 replies; 34+ messages in thread
From: Christian Borntraeger @ 2021-05-18 15:45 UTC (permalink / raw)
  To: Claudio Imbrenda, Cornelia Huck
  Cc: kvm, frankja, thuth, pasic, david, linux-s390, linux-kernel



On 18.05.21 17:36, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:05:37 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Mon, 17 May 2021 22:07:47 +0200
>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>
>>> Previously, when a protected VM was rebooted or when it was shut
>>> down, its memory was made unprotected, and then the protected VM
>>> itself was destroyed. Looping over the whole address space can take
>>> some time, considering the overhead of the various Ultravisor Calls
>>> (UVCs).  This means that a reboot or a shutdown would take a
>>> potentially long amount of time, depending on the amount of used
>>> memory.
>>>
>>> This patchseries implements a deferred destroy mechanism for
>>> protected guests. When a protected guest is destroyed, its memory
>>> is cleared in background, allowing the guest to restart or
>>> terminate significantly faster than before.
>>>
>>> There are 2 possibilities when a protected VM is torn down:
>>> * it still has an address space associated (reboot case)
>>> * it does not have an address space anymore (shutdown case)
>>>
>>> For the reboot case, the reference count of the mm is increased, and
>>> then a background thread is started to clean up. Once the thread
>>> went through the whole address space, the protected VM is actually
>>> destroyed.
>>>
>>> For the shutdown case, a list of pages to be destroyed is formed
>>> when the mm is torn down. Instead of just unmapping the pages when
>>> the address space is being torn down, they are also set aside.
>>> Later when KVM cleans up the VM, a thread is started to clean up
>>> the pages from the list.
>>
>> Just to make sure, 'clean up' includes doing uv calls?
> 
> yes
> 
>>>
>>> This means that the same address space can have memory belonging to
>>> more than one protected guest, although only one will be running,
>>> the others will in fact not even have any CPUs.
>>
>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>> accessible in any way? I would assume that they only belong to the
> 
> in case of reboot: yes, they are still in the address space of the
> guest, and can be swapped if needed
> 
>> 'zombie' guests, and any new or rebooted guest is a new entity that
>> needs to get new pages?
> 
> the rebooted guest (normal or secure) will re-use the same pages of the
> old guest (before or after cleanup, which is the reason of patches 3
> and 4)
> 
> the KVM guest is not affected in case of reboot, so the userspace
> address space is not touched.
> 
>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>> exhaustion?
> 
> in case of reboot, not much; the pages were in use are still in use
> after the reboot, and they can be swapped.
> 
> in case of a shutdown, yes, because the pages are really taken aside
> and cleared/destroyed in background. they cannot be swapped. they are
> freed immediately as they are processed, to try to mitigate memory
> exhaustion scenarios.
> 
> in the end, this patchseries is a tradeoff between speed and memory
> consumption. the memory needs to be cleared up at some point, and that
> requires time.
> 
> in cases where this might be an issue, I introduced a new KVM flag to
> disable lazy destroy (patch 10)

Maybe we could piggy-back on the OOM-kill notifier and then fall back to
synchronous freeing for some pages?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 15:45     ` Christian Borntraeger
@ 2021-05-18 15:52       ` Cornelia Huck
  2021-05-18 16:13       ` Claudio Imbrenda
  1 sibling, 0 replies; 34+ messages in thread
From: Cornelia Huck @ 2021-05-18 15:52 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Claudio Imbrenda, kvm, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 17:36, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:

> >> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> >> exhaustion?  
> > 
> > in case of reboot, not much; the pages were in use are still in use
> > after the reboot, and they can be swapped.
> > 
> > in case of a shutdown, yes, because the pages are really taken aside
> > and cleared/destroyed in background. they cannot be swapped. they are
> > freed immediately as they are processed, to try to mitigate memory
> > exhaustion scenarios.
> > 
> > in the end, this patchseries is a tradeoff between speed and memory
> > consumption. the memory needs to be cleared up at some point, and that
> > requires time.
> > 
> > in cases where this might be an issue, I introduced a new KVM flag to
> > disable lazy destroy (patch 10)  
> 
> Maybe we could piggy-back on the OOM-kill notifier and then fall back to
> synchronous freeing for some pages?

Sounds like a good idea. If delayed cleanup is safe, you probably want
to have the fast shutdown behaviour.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 15:36   ` Claudio Imbrenda
  2021-05-18 15:45     ` Christian Borntraeger
@ 2021-05-18 16:04     ` Cornelia Huck
  2021-05-18 16:19       ` Claudio Imbrenda
  1 sibling, 1 reply; 34+ messages in thread
From: Cornelia Huck @ 2021-05-18 16:04 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: kvm, borntraeger, frankja, thuth, pasic, david, linux-s390, linux-kernel

On Tue, 18 May 2021 17:36:24 +0200
Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> On Tue, 18 May 2021 17:05:37 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Mon, 17 May 2021 22:07:47 +0200
> > Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:

> > > This means that the same address space can have memory belonging to
> > > more than one protected guest, although only one will be running,
> > > the others will in fact not even have any CPUs.    
> > 
> > Are those set-aside-but-not-yet-cleaned-up pages still possibly
> > accessible in any way? I would assume that they only belong to the  
> 
> in case of reboot: yes, they are still in the address space of the
> guest, and can be swapped if needed
> 
> > 'zombie' guests, and any new or rebooted guest is a new entity that
> > needs to get new pages?  
> 
> the rebooted guest (normal or secure) will re-use the same pages of the
> old guest (before or after cleanup, which is the reason of patches 3
> and 4)

Took a look at those patches, makes sense.

> 
> the KVM guest is not affected in case of reboot, so the userspace
> address space is not touched.

'guest' is a bit ambiguous here -- do you mean the vm here, and the
actual guest above?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 15:45     ` Christian Borntraeger
  2021-05-18 15:52       ` Cornelia Huck
@ 2021-05-18 16:13       ` Claudio Imbrenda
  2021-05-18 16:20         ` Christian Borntraeger
  1 sibling, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 16:13 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Cornelia Huck, kvm, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 17:36, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Mon, 17 May 2021 22:07:47 +0200
> >> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>  
> >>> Previously, when a protected VM was rebooted or when it was shut
> >>> down, its memory was made unprotected, and then the protected VM
> >>> itself was destroyed. Looping over the whole address space can
> >>> take some time, considering the overhead of the various
> >>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
> >>> would take a potentially long amount of time, depending on the
> >>> amount of used memory.
> >>>
> >>> This patchseries implements a deferred destroy mechanism for
> >>> protected guests. When a protected guest is destroyed, its memory
> >>> is cleared in background, allowing the guest to restart or
> >>> terminate significantly faster than before.
> >>>
> >>> There are 2 possibilities when a protected VM is torn down:
> >>> * it still has an address space associated (reboot case)
> >>> * it does not have an address space anymore (shutdown case)
> >>>
> >>> For the reboot case, the reference count of the mm is increased,
> >>> and then a background thread is started to clean up. Once the
> >>> thread went through the whole address space, the protected VM is
> >>> actually destroyed.
> >>>
> >>> For the shutdown case, a list of pages to be destroyed is formed
> >>> when the mm is torn down. Instead of just unmapping the pages when
> >>> the address space is being torn down, they are also set aside.
> >>> Later when KVM cleans up the VM, a thread is started to clean up
> >>> the pages from the list.  
> >>
> >> Just to make sure, 'clean up' includes doing uv calls?  
> > 
> > yes
> >   
> >>>
> >>> This means that the same address space can have memory belonging
> >>> to more than one protected guest, although only one will be
> >>> running, the others will in fact not even have any CPUs.  
> >>
> >> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >> accessible in any way? I would assume that they only belong to the
> >>  
> > 
> > in case of reboot: yes, they are still in the address space of the
> > guest, and can be swapped if needed
> >   
> >> 'zombie' guests, and any new or rebooted guest is a new entity that
> >> needs to get new pages?  
> > 
> > the rebooted guest (normal or secure) will re-use the same pages of
> > the old guest (before or after cleanup, which is the reason of
> > patches 3 and 4)
> > 
> > the KVM guest is not affected in case of reboot, so the userspace
> > address space is not touched.
> >   
> >> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> >> exhaustion?  
> > 
> > in case of reboot, not much; the pages were in use are still in use
> > after the reboot, and they can be swapped.
> > 
> > in case of a shutdown, yes, because the pages are really taken aside
> > and cleared/destroyed in background. they cannot be swapped. they
> > are freed immediately as they are processed, to try to mitigate
> > memory exhaustion scenarios.
> > 
> > in the end, this patchseries is a tradeoff between speed and memory
> > consumption. the memory needs to be cleared up at some point, and
> > that requires time.
> > 
> > in cases where this might be an issue, I introduced a new KVM flag
> > to disable lazy destroy (patch 10)  
> 
> Maybe we could piggy-back on the OOM-kill notifier and then fall back
> to synchronous freeing for some pages?

I'm not sure I follow

once the pages have been set aside, it's too late

while the pages are being set aside, every now and then some memory
needs to be allocated. the allocation is atomic, not allowed to use
emergency reserves, and can fail without warning. if the allocation
fails, we clean up one page and continue, without setting aside
anything (patch 9)

so if the system is low on memory, the lazy destroy should not make the
situation too much worse.

the only issue here is starting a normal process in the host (maybe
a non secure guest) that uses a lot of memory very quickly, right after
a large secure guest has terminated.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:04     ` Cornelia Huck
@ 2021-05-18 16:19       ` Claudio Imbrenda
  2021-05-18 16:22         ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 16:19 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, borntraeger, frankja, thuth, pasic, david, linux-s390, linux-kernel

On Tue, 18 May 2021 18:04:11 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 18 May 2021 17:36:24 +0200
> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> 
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> > > On Mon, 17 May 2021 22:07:47 +0200
> > > Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:  
> 
> > > > This means that the same address space can have memory
> > > > belonging to more than one protected guest, although only one
> > > > will be running, the others will in fact not even have any
> > > > CPUs.      
> > > 
> > > Are those set-aside-but-not-yet-cleaned-up pages still possibly
> > > accessible in any way? I would assume that they only belong to
> > > the    
> > 
> > in case of reboot: yes, they are still in the address space of the
> > guest, and can be swapped if needed
> >   
> > > 'zombie' guests, and any new or rebooted guest is a new entity
> > > that needs to get new pages?    
> > 
> > the rebooted guest (normal or secure) will re-use the same pages of
> > the old guest (before or after cleanup, which is the reason of
> > patches 3 and 4)  
> 
> Took a look at those patches, makes sense.
> 
> > 
> > the KVM guest is not affected in case of reboot, so the userspace
> > address space is not touched.  
> 
> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
> actual guest above?
> 

yes this is tricky, because there is the guest OS, which terminates or
reboots, then there is the "secure configuration" entity, handled by the
Ultravisor, and then the KVM VM

when a secure guest reboots, the "secure configuration" is dismantled
(in this case, in a deferred way), and the KVM VM (and its memory) is
not directly affected

what happened before was that the secure configuration was dismantled
synchronously, and then re-created.

now instead, a new secure configuration is created using the same KVM
VM (and thus the same mm), before the old secure configuration has been
completely dismantled. hence the same KVM VM can have multiple secure
configurations associated, sharing the same address space.

of course, only the newest one is actually running, the other ones are
"zombies", without CPUs.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:13       ` Claudio Imbrenda
@ 2021-05-18 16:20         ` Christian Borntraeger
  2021-05-18 16:34           ` Claudio Imbrenda
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Borntraeger @ 2021-05-18 16:20 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Cornelia Huck, kvm, frankja, thuth, pasic, david, linux-s390,
	linux-kernel



On 18.05.21 18:13, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:45:18 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>   
>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>> down, its memory was made unprotected, and then the protected VM
>>>>> itself was destroyed. Looping over the whole address space can
>>>>> take some time, considering the overhead of the various
>>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
>>>>> would take a potentially long amount of time, depending on the
>>>>> amount of used memory.
>>>>>
>>>>> This patchseries implements a deferred destroy mechanism for
>>>>> protected guests. When a protected guest is destroyed, its memory
>>>>> is cleared in background, allowing the guest to restart or
>>>>> terminate significantly faster than before.
>>>>>
>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>> * it still has an address space associated (reboot case)
>>>>> * it does not have an address space anymore (shutdown case)
>>>>>
>>>>> For the reboot case, the reference count of the mm is increased,
>>>>> and then a background thread is started to clean up. Once the
>>>>> thread went through the whole address space, the protected VM is
>>>>> actually destroyed.
>>>>>
>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>> when the mm is torn down. Instead of just unmapping the pages when
>>>>> the address space is being torn down, they are also set aside.
>>>>> Later when KVM cleans up the VM, a thread is started to clean up
>>>>> the pages from the list.
>>>>
>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>
>>> yes
>>>    
>>>>>
>>>>> This means that the same address space can have memory belonging
>>>>> to more than one protected guest, although only one will be
>>>>> running, the others will in fact not even have any CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to the
>>>>   
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>    
>>>> 'zombie' guests, and any new or rebooted guest is a new entity that
>>>> needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>>    
>>>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>>>> exhaustion?
>>>
>>> in case of reboot, not much; the pages were in use are still in use
>>> after the reboot, and they can be swapped.
>>>
>>> in case of a shutdown, yes, because the pages are really taken aside
>>> and cleared/destroyed in background. they cannot be swapped. they
>>> are freed immediately as they are processed, to try to mitigate
>>> memory exhaustion scenarios.
>>>
>>> in the end, this patchseries is a tradeoff between speed and memory
>>> consumption. the memory needs to be cleared up at some point, and
>>> that requires time.
>>>
>>> in cases where this might be an issue, I introduced a new KVM flag
>>> to disable lazy destroy (patch 10)
>>
>> Maybe we could piggy-back on the OOM-kill notifier and then fall back
>> to synchronous freeing for some pages?
> 
> I'm not sure I follow
> 
> once the pages have been set aside, it's too late
> 
> while the pages are being set aside, every now and then some memory
> needs to be allocated. the allocation is atomic, not allowed to use
> emergency reserves, and can fail without warning. if the allocation
> fails, we clean up one page and continue, without setting aside
> anything (patch 9)
> 
> so if the system is low on memory, the lazy destroy should not make the
> situation too much worse.
> 
> the only issue here is starting a normal process in the host (maybe
> a non secure guest) that uses a lot of memory very quickly, right after
> a large secure guest has terminated.

I think page cache page allocations do not need to be atomic.
In that case the kernel might stil l decide to trigger the oom killer. We can
let it notify ourselves free 256 pages synchronously and avoid the oom kill.
Have a look at the virtio-balloon virtio_balloon_oom_notify

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:19       ` Claudio Imbrenda
@ 2021-05-18 16:22         ` David Hildenbrand
  2021-05-18 16:31           ` Claudio Imbrenda
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2021-05-18 16:22 UTC (permalink / raw)
  To: Claudio Imbrenda, Cornelia Huck
  Cc: kvm, borntraeger, frankja, thuth, pasic, linux-s390, linux-kernel

On 18.05.21 18:19, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:04:11 +0200
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Tue, 18 May 2021 17:36:24 +0200
>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>
>>>>> This means that the same address space can have memory
>>>>> belonging to more than one protected guest, although only one
>>>>> will be running, the others will in fact not even have any
>>>>> CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to
>>>> the
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>    
>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>> that needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>
>> Took a look at those patches, makes sense.
>>
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>
>> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
>> actual guest above?
>>
> 
> yes this is tricky, because there is the guest OS, which terminates or
> reboots, then there is the "secure configuration" entity, handled by the
> Ultravisor, and then the KVM VM
> 
> when a secure guest reboots, the "secure configuration" is dismantled
> (in this case, in a deferred way), and the KVM VM (and its memory) is
> not directly affected
> 
> what happened before was that the secure configuration was dismantled
> synchronously, and then re-created.
> 
> now instead, a new secure configuration is created using the same KVM
> VM (and thus the same mm), before the old secure configuration has been
> completely dismantled. hence the same KVM VM can have multiple secure
> configurations associated, sharing the same address space.
> 
> of course, only the newest one is actually running, the other ones are
> "zombies", without CPUs.
> 

Can a guest trigger a DoS?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:22         ` David Hildenbrand
@ 2021-05-18 16:31           ` Claudio Imbrenda
  2021-05-18 16:55             ` Christian Borntraeger
  0 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 16:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Cornelia Huck, kvm, borntraeger, frankja, thuth, pasic,
	linux-s390, linux-kernel

On Tue, 18 May 2021 18:22:42 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 18.05.21 18:19, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 18:04:11 +0200
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Tue, 18 May 2021 17:36:24 +0200
> >> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>  
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:  
> >>  
> >>>>> This means that the same address space can have memory
> >>>>> belonging to more than one protected guest, although only one
> >>>>> will be running, the others will in fact not even have any
> >>>>> CPUs.  
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the  
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>      
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?  
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)  
> >>
> >> Took a look at those patches, makes sense.
> >>  
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.  
> >>
> >> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
> >> actual guest above?
> >>  
> > 
> > yes this is tricky, because there is the guest OS, which terminates
> > or reboots, then there is the "secure configuration" entity,
> > handled by the Ultravisor, and then the KVM VM
> > 
> > when a secure guest reboots, the "secure configuration" is
> > dismantled (in this case, in a deferred way), and the KVM VM (and
> > its memory) is not directly affected
> > 
> > what happened before was that the secure configuration was
> > dismantled synchronously, and then re-created.
> > 
> > now instead, a new secure configuration is created using the same
> > KVM VM (and thus the same mm), before the old secure configuration
> > has been completely dismantled. hence the same KVM VM can have
> > multiple secure configurations associated, sharing the same address
> > space.
> > 
> > of course, only the newest one is actually running, the other ones
> > are "zombies", without CPUs.
> >   
> 
> Can a guest trigger a DoS?

I don't see how

a guest can fill its memory and then reboot, and then fill its memory
again and then reboot... but that will take time, filling the memory
will itself clean up leftover pages from previous boots.

"normal" reboot loops will be fast, because there won't be much memory
to process

I have actually tested mixed reboot/shutdown loops, and the system
behaved as you would expect when under load.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:20         ` Christian Borntraeger
@ 2021-05-18 16:34           ` Claudio Imbrenda
  2021-05-18 16:35             ` Christian Borntraeger
  0 siblings, 1 reply; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 16:34 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Cornelia Huck, kvm, frankja, thuth, pasic, david, linux-s390,
	linux-kernel

On Tue, 18 May 2021 18:20:22 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 18:13, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:45:18 +0200
> > Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> >   
> >> On 18.05.21 17:36, Claudio Imbrenda wrote:  
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>>>     
> >>>>> Previously, when a protected VM was rebooted or when it was shut
> >>>>> down, its memory was made unprotected, and then the protected VM
> >>>>> itself was destroyed. Looping over the whole address space can
> >>>>> take some time, considering the overhead of the various
> >>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
> >>>>> would take a potentially long amount of time, depending on the
> >>>>> amount of used memory.
> >>>>>
> >>>>> This patchseries implements a deferred destroy mechanism for
> >>>>> protected guests. When a protected guest is destroyed, its
> >>>>> memory is cleared in background, allowing the guest to restart
> >>>>> or terminate significantly faster than before.
> >>>>>
> >>>>> There are 2 possibilities when a protected VM is torn down:
> >>>>> * it still has an address space associated (reboot case)
> >>>>> * it does not have an address space anymore (shutdown case)
> >>>>>
> >>>>> For the reboot case, the reference count of the mm is increased,
> >>>>> and then a background thread is started to clean up. Once the
> >>>>> thread went through the whole address space, the protected VM is
> >>>>> actually destroyed.
> >>>>>
> >>>>> For the shutdown case, a list of pages to be destroyed is formed
> >>>>> when the mm is torn down. Instead of just unmapping the pages
> >>>>> when the address space is being torn down, they are also set
> >>>>> aside. Later when KVM cleans up the VM, a thread is started to
> >>>>> clean up the pages from the list.  
> >>>>
> >>>> Just to make sure, 'clean up' includes doing uv calls?  
> >>>
> >>> yes
> >>>      
> >>>>>
> >>>>> This means that the same address space can have memory belonging
> >>>>> to more than one protected guest, although only one will be
> >>>>> running, the others will in fact not even have any CPUs.  
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the 
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>      
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?  
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.
> >>>      
> >>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
> >>>> memory exhaustion?  
> >>>
> >>> in case of reboot, not much; the pages were in use are still in
> >>> use after the reboot, and they can be swapped.
> >>>
> >>> in case of a shutdown, yes, because the pages are really taken
> >>> aside and cleared/destroyed in background. they cannot be
> >>> swapped. they are freed immediately as they are processed, to try
> >>> to mitigate memory exhaustion scenarios.
> >>>
> >>> in the end, this patchseries is a tradeoff between speed and
> >>> memory consumption. the memory needs to be cleared up at some
> >>> point, and that requires time.
> >>>
> >>> in cases where this might be an issue, I introduced a new KVM flag
> >>> to disable lazy destroy (patch 10)  
> >>
> >> Maybe we could piggy-back on the OOM-kill notifier and then fall
> >> back to synchronous freeing for some pages?  
> > 
> > I'm not sure I follow
> > 
> > once the pages have been set aside, it's too late
> > 
> > while the pages are being set aside, every now and then some memory
> > needs to be allocated. the allocation is atomic, not allowed to use
> > emergency reserves, and can fail without warning. if the allocation
> > fails, we clean up one page and continue, without setting aside
> > anything (patch 9)
> > 
> > so if the system is low on memory, the lazy destroy should not make
> > the situation too much worse.
> > 
> > the only issue here is starting a normal process in the host (maybe
> > a non secure guest) that uses a lot of memory very quickly, right
> > after a large secure guest has terminated.  
> 
> I think page cache page allocations do not need to be atomic.
> In that case the kernel might stil l decide to trigger the oom
> killer. We can let it notify ourselves free 256 pages synchronously
> and avoid the oom kill. Have a look at the virtio-balloon
> virtio_balloon_oom_notify

the issue is that once the pages have been set aside, it's too late.
the OOM notifier would only be useful if we get notified of the OOM
situation _while_ setting aside the pages.

unless you mean that the notifier should simply wait until the thread
has done (some of) its work?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:34           ` Claudio Imbrenda
@ 2021-05-18 16:35             ` Christian Borntraeger
  0 siblings, 0 replies; 34+ messages in thread
From: Christian Borntraeger @ 2021-05-18 16:35 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Cornelia Huck, kvm, frankja, thuth, pasic, david, linux-s390,
	linux-kernel



On 18.05.21 18:34, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:20:22 +0200
> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
>> On 18.05.21 18:13, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:45:18 +0200
>>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>>    
>>>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>>>      
>>>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>>>> down, its memory was made unprotected, and then the protected VM
>>>>>>> itself was destroyed. Looping over the whole address space can
>>>>>>> take some time, considering the overhead of the various
>>>>>>> Ultravisor Calls (UVCs).  This means that a reboot or a shutdown
>>>>>>> would take a potentially long amount of time, depending on the
>>>>>>> amount of used memory.
>>>>>>>
>>>>>>> This patchseries implements a deferred destroy mechanism for
>>>>>>> protected guests. When a protected guest is destroyed, its
>>>>>>> memory is cleared in background, allowing the guest to restart
>>>>>>> or terminate significantly faster than before.
>>>>>>>
>>>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>>>> * it still has an address space associated (reboot case)
>>>>>>> * it does not have an address space anymore (shutdown case)
>>>>>>>
>>>>>>> For the reboot case, the reference count of the mm is increased,
>>>>>>> and then a background thread is started to clean up. Once the
>>>>>>> thread went through the whole address space, the protected VM is
>>>>>>> actually destroyed.
>>>>>>>
>>>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>>>> when the mm is torn down. Instead of just unmapping the pages
>>>>>>> when the address space is being torn down, they are also set
>>>>>>> aside. Later when KVM cleans up the VM, a thread is started to
>>>>>>> clean up the pages from the list.
>>>>>>
>>>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>>>
>>>>> yes
>>>>>       
>>>>>>>
>>>>>>> This means that the same address space can have memory belonging
>>>>>>> to more than one protected guest, although only one will be
>>>>>>> running, the others will in fact not even have any CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>       
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>>       
>>>>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
>>>>>> memory exhaustion?
>>>>>
>>>>> in case of reboot, not much; the pages were in use are still in
>>>>> use after the reboot, and they can be swapped.
>>>>>
>>>>> in case of a shutdown, yes, because the pages are really taken
>>>>> aside and cleared/destroyed in background. they cannot be
>>>>> swapped. they are freed immediately as they are processed, to try
>>>>> to mitigate memory exhaustion scenarios.
>>>>>
>>>>> in the end, this patchseries is a tradeoff between speed and
>>>>> memory consumption. the memory needs to be cleared up at some
>>>>> point, and that requires time.
>>>>>
>>>>> in cases where this might be an issue, I introduced a new KVM flag
>>>>> to disable lazy destroy (patch 10)
>>>>
>>>> Maybe we could piggy-back on the OOM-kill notifier and then fall
>>>> back to synchronous freeing for some pages?
>>>
>>> I'm not sure I follow
>>>
>>> once the pages have been set aside, it's too late
>>>
>>> while the pages are being set aside, every now and then some memory
>>> needs to be allocated. the allocation is atomic, not allowed to use
>>> emergency reserves, and can fail without warning. if the allocation
>>> fails, we clean up one page and continue, without setting aside
>>> anything (patch 9)
>>>
>>> so if the system is low on memory, the lazy destroy should not make
>>> the situation too much worse.
>>>
>>> the only issue here is starting a normal process in the host (maybe
>>> a non secure guest) that uses a lot of memory very quickly, right
>>> after a large secure guest has terminated.
>>
>> I think page cache page allocations do not need to be atomic.
>> In that case the kernel might stil l decide to trigger the oom
>> killer. We can let it notify ourselves free 256 pages synchronously
>> and avoid the oom kill. Have a look at the virtio-balloon
>> virtio_balloon_oom_notify
> 
> the issue is that once the pages have been set aside, it's too late.
> the OOM notifier would only be useful if we get notified of the OOM
> situation _while_ setting aside the pages.
> 
> unless you mean that the notifier should simply wait until the thread
> has done (some of) its work?

Exactly. Let the notifier wait until you have freed 256pages and return
256 to the oom notifier.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:31           ` Claudio Imbrenda
@ 2021-05-18 16:55             ` Christian Borntraeger
  2021-05-18 17:00               ` Claudio Imbrenda
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Borntraeger @ 2021-05-18 16:55 UTC (permalink / raw)
  To: Claudio Imbrenda, David Hildenbrand
  Cc: Cornelia Huck, kvm, frankja, thuth, pasic, linux-s390, linux-kernel



On 18.05.21 18:31, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:22:42 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.05.21 18:19, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 18:04:11 +0200
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Tue, 18 May 2021 17:36:24 +0200
>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>   
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
>>>>   
>>>>>>> This means that the same address space can have memory
>>>>>>> belonging to more than one protected guest, although only one
>>>>>>> will be running, the others will in fact not even have any
>>>>>>> CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>       
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>
>>>> Took a look at those patches, makes sense.
>>>>   
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>
>>>> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
>>>> actual guest above?
>>>>   
>>>
>>> yes this is tricky, because there is the guest OS, which terminates
>>> or reboots, then there is the "secure configuration" entity,
>>> handled by the Ultravisor, and then the KVM VM
>>>
>>> when a secure guest reboots, the "secure configuration" is
>>> dismantled (in this case, in a deferred way), and the KVM VM (and
>>> its memory) is not directly affected
>>>
>>> what happened before was that the secure configuration was
>>> dismantled synchronously, and then re-created.
>>>
>>> now instead, a new secure configuration is created using the same
>>> KVM VM (and thus the same mm), before the old secure configuration
>>> has been completely dismantled. hence the same KVM VM can have
>>> multiple secure configurations associated, sharing the same address
>>> space.
>>>
>>> of course, only the newest one is actually running, the other ones
>>> are "zombies", without CPUs.
>>>    
>>
>> Can a guest trigger a DoS?
> 
> I don't see how
> 
> a guest can fill its memory and then reboot, and then fill its memory
> again and then reboot... but that will take time, filling the memory
> will itself clean up leftover pages from previous boots.

In essence this guest will then synchronously wait for the page to be
exported and reimported, correct?
> 
> "normal" reboot loops will be fast, because there won't be much memory
> to process
> 
> I have actually tested mixed reboot/shutdown loops, and the system
> behaved as you would expect when under load.

I guess the memory will continue to be accounted to the memcg? Correct?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy
  2021-05-18 16:55             ` Christian Borntraeger
@ 2021-05-18 17:00               ` Claudio Imbrenda
  0 siblings, 0 replies; 34+ messages in thread
From: Claudio Imbrenda @ 2021-05-18 17:00 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: David Hildenbrand, Cornelia Huck, kvm, frankja, thuth, pasic,
	linux-s390, linux-kernel

On Tue, 18 May 2021 18:55:56 +0200
Christian Borntraeger <borntraeger@de.ibm.com> wrote:

> On 18.05.21 18:31, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 18:22:42 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 18.05.21 18:19, Claudio Imbrenda wrote:  
> >>> On Tue, 18 May 2021 18:04:11 +0200
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Tue, 18 May 2021 17:36:24 +0200
> >>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:
> >>>>     
> >>>>> On Tue, 18 May 2021 17:05:37 +0200
> >>>>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>>>         
> >>>>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>>>> Claudio Imbrenda <imbrenda@linux.ibm.com> wrote:  
> >>>>     
> >>>>>>> This means that the same address space can have memory
> >>>>>>> belonging to more than one protected guest, although only one
> >>>>>>> will be running, the others will in fact not even have any
> >>>>>>> CPUs.  
> >>>>>>
> >>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>>>> accessible in any way? I would assume that they only belong to
> >>>>>> the  
> >>>>>
> >>>>> in case of reboot: yes, they are still in the address space of
> >>>>> the guest, and can be swapped if needed
> >>>>>         
> >>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>>>> that needs to get new pages?  
> >>>>>
> >>>>> the rebooted guest (normal or secure) will re-use the same pages
> >>>>> of the old guest (before or after cleanup, which is the reason
> >>>>> of patches 3 and 4)  
> >>>>
> >>>> Took a look at those patches, makes sense.
> >>>>     
> >>>>>
> >>>>> the KVM guest is not affected in case of reboot, so the
> >>>>> userspace address space is not touched.  
> >>>>
> >>>> 'guest' is a bit ambiguous here -- do you mean the vm here, and
> >>>> the actual guest above?
> >>>>     
> >>>
> >>> yes this is tricky, because there is the guest OS, which
> >>> terminates or reboots, then there is the "secure configuration"
> >>> entity, handled by the Ultravisor, and then the KVM VM
> >>>
> >>> when a secure guest reboots, the "secure configuration" is
> >>> dismantled (in this case, in a deferred way), and the KVM VM (and
> >>> its memory) is not directly affected
> >>>
> >>> what happened before was that the secure configuration was
> >>> dismantled synchronously, and then re-created.
> >>>
> >>> now instead, a new secure configuration is created using the same
> >>> KVM VM (and thus the same mm), before the old secure configuration
> >>> has been completely dismantled. hence the same KVM VM can have
> >>> multiple secure configurations associated, sharing the same
> >>> address space.
> >>>
> >>> of course, only the newest one is actually running, the other ones
> >>> are "zombies", without CPUs.
> >>>      
> >>
> >> Can a guest trigger a DoS?  
> > 
> > I don't see how
> > 
> > a guest can fill its memory and then reboot, and then fill its
> > memory again and then reboot... but that will take time, filling
> > the memory will itself clean up leftover pages from previous boots.
> >  
> 
> In essence this guest will then synchronously wait for the page to be
> exported and reimported, correct?

correct

> > "normal" reboot loops will be fast, because there won't be much
> > memory to process
> > 
> > I have actually tested mixed reboot/shutdown loops, and the system
> > behaved as you would expect when under load.  
> 
> I guess the memory will continue to be accounted to the memcg?
> Correct?

for the reboot case, yes, since the mm is not directly affected.
for the shutdown case, I'm not sure.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 07/11] KVM: s390: pv: add export before import
  2021-05-17 20:07 ` [PATCH v1 07/11] KVM: s390: pv: add export before import Claudio Imbrenda
@ 2021-05-26 11:56   ` Janosch Frank
  0 siblings, 0 replies; 34+ messages in thread
From: Janosch Frank @ 2021-05-26 11:56 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Due to upcoming changes, it will be possible to temporarily have
> multiple protected VMs in the same address space. When that happens,
> it is necessary to perform an export of every page that is to be
> imported.

... since the Ultravisor doesn't allow KVM to import a secure page
belonging to guest A to be imported for guest B in order to guarantee
proper guest isolation.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>

> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  arch/s390/kernel/uv.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index b19b1a1444ec..dbcf4434eb53 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -242,6 +242,12 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
>  	return rc;
>  }
>  
> +static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
> +{
> +	return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
> +		atomic_read(&mm->context.is_protected) > 1;
> +}
> +
>  /*
>   * Requests the Ultravisor to make a page accessible to a guest.
>   * If it's brought in the first time, it will be cleared. If
> @@ -285,6 +291,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
>  
>  	lock_page(page);
>  	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> +	if (should_export_before_import(uvcb, gmap->mm))
> +		uv_convert_from_secure(page_to_phys(page));
>  	rc = make_secure_pte(ptep, uaddr, page, uvcb);
>  	pte_unmap_unlock(ptep, ptelock);
>  	unlock_page(page);
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 05/11] KVM: s390: pv: refactor s390_reset_acc
  2021-05-17 20:07 ` [PATCH v1 05/11] KVM: s390: pv: refactor s390_reset_acc Claudio Imbrenda
@ 2021-05-26 12:11   ` Janosch Frank
  0 siblings, 0 replies; 34+ messages in thread
From: Janosch Frank @ 2021-05-26 12:11 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Refactor s390_reset_acc so that its pieces can be reused in upcoming
> patches. The users parameter for s390_destroy_range will be needed in
> upcoming patches.
> 
> We don't want to hold all the locks used in a walk_page_range for too
> long, and the destroy page UVC does take some time to complete.
> Therefore we quickly gather the pages to destroy, and then destroy them
> without holding all the locks.
> 

Acked-by: Janosch Frank <frankja@linux.ibm.com>

> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  arch/s390/include/asm/gmap.h |  5 +-
>  arch/s390/kvm/pv.c           | 12 ++++-
>  arch/s390/mm/gmap.c          | 88 ++++++++++++++++++++++++------------
>  3 files changed, 73 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> index 40264f60b0da..618ddc455867 100644
> --- a/arch/s390/include/asm/gmap.h
> +++ b/arch/s390/include/asm/gmap.h
> @@ -147,5 +147,8 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
>  void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
>  			     unsigned long gaddr, unsigned long vmaddr);
>  int gmap_mark_unmergeable(void);
> -void s390_reset_acc(struct mm_struct *mm);
> +void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
> +			   unsigned long start, unsigned long end);
> +void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
> +
>  #endif /* _ASM_S390_GMAP_H */
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index e0532ab725bf..c3f9f30d2ed4 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -12,6 +12,8 @@
>  #include <asm/gmap.h>
>  #include <asm/uv.h>
>  #include <asm/mman.h>
> +#include <linux/pagewalk.h>
> +#include <linux/sched/mm.h>
>  #include "kvm-s390.h"
>  
>  int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> @@ -204,8 +206,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  {
>  	int cc;
>  
> -	/* make all pages accessible before destroying the guest */
> -	s390_reset_acc(kvm->mm);
> +	/*
> +	 * if the mm still has a mapping, make all its pages accessible
> +	 * before destroying the guest
> +	 */
> +	if (mmget_not_zero(kvm->mm)) {
> +		s390_uv_destroy_range(kvm->mm, 0, 0, TASK_SIZE);
> +		mmput(kvm->mm);
> +	}
>  
>  	cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
>  			   UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index de679facc720..ad210a6e2c41 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2670,41 +2670,71 @@ void s390_reset_cmma(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(s390_reset_cmma);
>  
> -/*
> - * make inaccessible pages accessible again
> - */
> -static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
> -			    unsigned long next, struct mm_walk *walk)
> +#define DESTROY_LOOP_THRESHOLD 32
> +
> +struct reset_walk_state {
> +	unsigned long next;
> +	unsigned long count;
> +	unsigned long pfns[DESTROY_LOOP_THRESHOLD];

Candidate for a module parameter and extensive performance testing?

> +};
> +
> +static int s390_gather_pages(pte_t *ptep, unsigned long addr,
> +			     unsigned long next, struct mm_walk *walk)

A "pv" somewhere in that function name would be helpful to me.
Also the "__" prefix applies here I think. We never call the function
directly and it's static. And I assume that's what the "__" prefix is
trying to tell us. :)

>  {
> +	struct reset_walk_state *p = walk->private;
>  	pte_t pte = READ_ONCE(*ptep);
>  
> -	/* There is a reference through the mapping */
> -	if (pte_present(pte))
> -		WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
> -
> -	return 0;
> +	if (pte_present(pte)) {
> +		/* we have a reference from the mapping, take an extra one */
> +		get_page(phys_to_page(pte_val(pte)));
> +		p->pfns[p->count] = phys_to_pfn(pte_val(pte));
> +		p->next = next;
> +		p->count++;
> +	}
> +	return p->count >= DESTROY_LOOP_THRESHOLD;
>  }
>  
> -static const struct mm_walk_ops reset_acc_walk_ops = {
> -	.pte_entry		= __s390_reset_acc,
> +static const struct mm_walk_ops gather_pages_ops = {
> +	.pte_entry = s390_gather_pages,
>  };
>  
> -#include <linux/sched/mm.h>
> -void s390_reset_acc(struct mm_struct *mm)
> +/*
> + * Call the Destroy secure page UVC on each page in the given array of PFNs.
> + * Each page needs to have an extra reference, which will be released here.
> + */
> +void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
>  {
> -	if (!mm_is_protected(mm))
> -		return;
> -	/*
> -	 * we might be called during
> -	 * reset:                             we walk the pages and clear
> -	 * close of all kvm file descriptors: we walk the pages and clear
> -	 * exit of process on fd closure:     vma already gone, do nothing
> -	 */
> -	if (!mmget_not_zero(mm))
> -		return;
> -	mmap_read_lock(mm);
> -	walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
> -	mmap_read_unlock(mm);
> -	mmput(mm);
> +	unsigned long i;
> +
> +	for (i = 0; i < count; i++) {
> +		/* we always have an extra reference */
> +		uv_destroy_owned_page(pfn_to_phys(pfns[i]));
> +		/* get rid of the extra reference */
> +		put_page(pfn_to_page(pfns[i]));
> +		cond_resched();
> +	}
> +}
> +EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
> +
> +/*
> + * Walk the given range of the given address space, and call the destroy
> + * secure page UVC on each page.
> + * Exit early if the number of users of the mm drops to (or below) the given
> + * value.
> + */
> +void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
> +			   unsigned long start, unsigned long end)
> +{
> +	struct reset_walk_state state = { .next = start };
> +	int r = 1;
> +
> +	while ((r > 0) && (atomic_read(&mm->mm_users) > users)) {
> +		state.count = 0;
> +		mmap_read_lock(mm);
> +		r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
> +		mmap_read_unlock(mm);
> +		cond_resched();
> +		s390_uv_destroy_pfns(state.count, state.pfns);
> +	}
>  }
> -EXPORT_SYMBOL_GPL(s390_reset_acc);
> +EXPORT_SYMBOL_GPL(s390_uv_destroy_range);
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 06/11] KVM: s390: pv: usage counter instead of flag
  2021-05-17 20:07 ` [PATCH v1 06/11] KVM: s390: pv: usage counter instead of flag Claudio Imbrenda
@ 2021-05-27  9:29   ` Janosch Frank
  0 siblings, 0 replies; 34+ messages in thread
From: Janosch Frank @ 2021-05-27  9:29 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Use the is_protected field as a counter instead of a flag. This will
> be used in upcoming patches.
> 
> Increment the counter when a secure configuration is created, and
> decrement it when it is destroyed. Previously the flag was set when the
> set secure parameters UVC was performed.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

Acked-by: Janosch Frank <frankja@linux.ibm.com>

> ---
>  arch/s390/kvm/pv.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index c3f9f30d2ed4..59039b8a7be7 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -218,7 +218,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  	cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
>  			   UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
>  	WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
> -	atomic_set(&kvm->mm->context.is_protected, 0);
> +	if (!cc)
> +		atomic_dec(&kvm->mm->context.is_protected);
>  	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
>  	WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
>  	/* Intended memory leak on "impossible" error */
> @@ -259,11 +260,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  	/* Outputs */
>  	kvm->arch.pv.handle = uvcb.guest_handle;
>  
> +	atomic_inc(&kvm->mm->context.is_protected);
>  	if (cc) {
> -		if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
> +		if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
>  			kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
> -		else
> +		} else {
> +			atomic_dec(&kvm->mm->context.is_protected);
>  			kvm_s390_pv_dealloc_vm(kvm);
> +		}
>  		return -EIO;
>  	}
>  	kvm->arch.gmap->guest_handle = uvcb.guest_handle;
> @@ -286,8 +290,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
>  	*rrc = uvcb.header.rrc;
>  	KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
>  		     *rc, *rrc);
> -	if (!cc)
> -		atomic_set(&kvm->mm->context.is_protected, 1);
>  	return cc ? -EINVAL : 0;
>  }
>  
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot
  2021-05-17 20:07 ` [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot Claudio Imbrenda
@ 2021-05-27  9:43   ` Janosch Frank
  0 siblings, 0 replies; 34+ messages in thread
From: Janosch Frank @ 2021-05-27  9:43 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Until now, destroying a protected guest was an entirely synchronous
> operation that could potentially take a very long time, depending on
> the size of the guest, due to the time needed to clean up the address
> space from protected pages.
> 
> This patch implements a lazy destroy mechanism, that allows a protected
> guest to reboot significantly faster than previously.
> 
> This is achieved by clearing the pages of the old guest in background.
> In case of reboot, the new guest will be able to run in the same
> address space almost immediately.
> 
> The old protected guest is then only destroyed when all of its memory has
> been destroyed or otherwise made non protected.

LGTM some comments below

> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  arch/s390/kvm/kvm-s390.c |   6 +-
>  arch/s390/kvm/kvm-s390.h |   2 +-
>  arch/s390/kvm/pv.c       | 118 ++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 120 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 2f09e9d7dc95..db25aa1fb6a6 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2248,7 +2248,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
>  
>  		r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
>  		if (r)
> -			kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
> +			kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);
>  
>  		/* we need to block service interrupts from now on */
>  		set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
> @@ -2267,7 +2267,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
>  		 */
>  		if (r)
>  			break;
> -		r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
> +		r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);
>  
>  		/* no need to block service interrupts any more */
>  		clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
> @@ -2796,7 +2796,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	 * complaining we do not use kvm_s390_pv_is_protected.
>  	 */
>  	if (kvm_s390_pv_get_handle(kvm))
> -		kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
> +		kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
>  	debug_unregister(kvm->arch.dbf);
>  	free_page((unsigned long)kvm->arch.sie_page2);
>  	if (!kvm_is_ucontrol(kvm))
> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index 79dcd647b378..b3c0796a3cc1 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
>  /* implemented in pv.c */
>  int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
>  int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
> -int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
> +int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
>  int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
>  int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
>  			      u16 *rrc);
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 59039b8a7be7..9a3547966e18 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -14,8 +14,17 @@
>  #include <asm/mman.h>
>  #include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
> +#include <linux/kthread.h>
>  #include "kvm-s390.h"
>  
> +struct deferred_priv {
> +	struct mm_struct *mm;
> +	unsigned long old_table;
> +	u64 handle;
> +	void *virt;
> +	unsigned long real;
> +};
> +
>  int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
>  {
>  	int cc = 0;
> @@ -202,7 +211,7 @@ static int kvm_s390_pv_replace_asce(struct kvm *kvm)
>  }
>  
>  /* this should not fail, but if it does, we must not free the donated memory */
> -int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> +static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
>  {
>  	int cc;
>  
> @@ -230,6 +239,111 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  	return cc ? -EIO : 0;
>  }
>  
> +static int kvm_s390_pv_destroy_vm_thread(void *priv)
> +{
> +	struct deferred_priv *p = priv;
> +	u16 rc, rrc;
> +	int r = 1;
> +
> +	/* Exit early if we end up being the only users of the mm */
> +	s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
> +	mmput(p->mm);

Where do we exit early?

> +
> +	r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
> +	WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
> +	if (r)
> +		return r;
> +	atomic_dec(&p->mm->context.is_protected);
> +
> +	/*
> +	 * Intentional leak in case the destroy secure VM call fails. The
> +	 * call should never fail if the hardware is not broken.
> +	 */
> +	free_pages(p->real, get_order(uv_info.guest_base_stor_len));
> +	free_pages(p->old_table, CRST_ALLOC_ORDER);
> +	vfree(p->virt);
> +	kfree(p);
> +	return 0;
> +}
> +
> +static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
> +{
> +	struct task_struct *t;
> +
> +	priv->virt = kvm->arch.pv.stor_var;
> +	priv->real = kvm->arch.pv.stor_base;

Now I know what the struct members actually mean...
Is there a reson you didn't like config_stor_* as a name or something
similar?

> +	priv->handle = kvm_s390_pv_get_handle(kvm);
> +	priv->old_table = (unsigned long)kvm->arch.gmap->table;
> +	WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
> +
> +	if (kvm_s390_pv_replace_asce(kvm))
> +		goto fail;
> +
> +	t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
> +			   "kvm_s390_pv_destroy_vm_thread");
> +	if (IS_ERR_OR_NULL(t))
> +		goto fail;
> +
> +	memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
> +	KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
> +	wake_up_process(t);
> +	/*
> +	 * no actual UVC is performed at this point, just return a successful
> +	 * rc value to make userspace happy, and an arbitrary rrc
> +	 */
> +	*rc = 1;
> +	*rrc = 42;
> +
> +	return 0;
> +
> +fail:
> +	kfree(priv);
> +	return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +}
> +
> +/*  Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
> +static void kvm_s390_clear_2g(struct kvm *kvm)
> +{
> +	struct kvm_memory_slot *slot;
> +	struct kvm_memslots *slots;
> +	unsigned long lim;
> +	int idx;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +
> +	slots = kvm_memslots(kvm);
> +	kvm_for_each_memslot(slot, slots) {
> +		if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
> +			continue;
> +		if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
> +			lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
> +		else
> +			lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
> +		s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
> +	}
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
> +{
> +	struct deferred_priv *priv;
> +
> +	priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
> +	if (!priv)
> +		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +
> +	if (mmget_not_zero(kvm->mm)) {
> +		kvm_s390_clear_2g(kvm);
> +	} else {
> +		/* No deferred work to do */
> +		kfree(priv);
> +		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +	}
> +	priv->mm = kvm->mm;
> +	return deferred_destroy(kvm, priv, rc, rrc);
> +}
> +
>  int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  {
>  	struct uv_cb_cgc uvcb = {
> @@ -263,7 +377,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  	atomic_inc(&kvm->mm->context.is_protected);
>  	if (cc) {
>  		if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
> -			kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
> +			kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
>  		} else {
>  			atomic_dec(&kvm->mm->context.is_protected);
>  			kvm_s390_pv_dealloc_vm(kvm);
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy
  2021-05-17 20:07 ` [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy Claudio Imbrenda
@ 2021-05-27 10:35   ` Janosch Frank
  0 siblings, 0 replies; 34+ messages in thread
From: Janosch Frank @ 2021-05-27 10:35 UTC (permalink / raw)
  To: Claudio Imbrenda, kvm
  Cc: cohuck, borntraeger, thuth, pasic, david, linux-s390, linux-kernel

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Add the module parameter "lazy_destroy", to allow the lazy destroy
> mechanism to be switched off. This might be useful for debugging
> purposes.
> 
> The parameter is enabled by default.
> 
> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> ---
>  arch/s390/kvm/pv.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 4333d3e54ef0..00c14406205f 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -26,6 +26,10 @@ struct deferred_priv {
>  	unsigned long real;
>  };
>  
> +static int lazy_destroy = 1;
> +module_param(lazy_destroy, int, 0444);

I'm pondering if we want to make that writable or not.

Reviewed-by: Janosch Frank <frankja@linux.ibm.com>

> +MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
> +
>  int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
>  {
>  	int cc = 0;
> @@ -348,6 +352,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
>  {
>  	struct deferred_priv *priv;
>  
> +	if (!lazy_destroy)
> +		kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +
>  	priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
>  	if (!priv)
>  		return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> @@ -396,6 +403,12 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
>  	/* Outputs */
>  	kvm->arch.pv.handle = uvcb.guest_handle;
>  
> +	if (!lazy_destroy) {
> +		mmap_write_lock(kvm->mm);
> +		kvm->mm->context.pv_sync_destroy = 1;
> +		mmap_write_unlock(kvm->mm);
> +	}
> +
>  	atomic_inc(&kvm->mm->context.is_protected);
>  	if (cc) {
>  		if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-05-27 10:35 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 20:07 [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Claudio Imbrenda
2021-05-17 20:07 ` [PATCH v1 01/11] KVM: s390: pv: leak the ASCE page when destroy fails Claudio Imbrenda
2021-05-18 10:26   ` Janosch Frank
2021-05-18 10:40     ` Claudio Imbrenda
2021-05-18 12:00       ` Janosch Frank
2021-05-17 20:07 ` [PATCH v1 02/11] KVM: s390: pv: properly handle page flags for protected guests Claudio Imbrenda
2021-05-17 20:07 ` [PATCH v1 03/11] KVM: s390: pv: handle secure storage violations " Claudio Imbrenda
2021-05-17 20:07 ` [PATCH v1 04/11] KVM: s390: pv: handle secure storage exceptions for normal guests Claudio Imbrenda
2021-05-17 20:07 ` [PATCH v1 05/11] KVM: s390: pv: refactor s390_reset_acc Claudio Imbrenda
2021-05-26 12:11   ` Janosch Frank
2021-05-17 20:07 ` [PATCH v1 06/11] KVM: s390: pv: usage counter instead of flag Claudio Imbrenda
2021-05-27  9:29   ` Janosch Frank
2021-05-17 20:07 ` [PATCH v1 07/11] KVM: s390: pv: add export before import Claudio Imbrenda
2021-05-26 11:56   ` Janosch Frank
2021-05-17 20:07 ` [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot Claudio Imbrenda
2021-05-27  9:43   ` Janosch Frank
2021-05-17 20:07 ` [PATCH v1 09/11] KVM: s390: pv: extend lazy destroy to handle shutdown Claudio Imbrenda
2021-05-17 20:07 ` [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy Claudio Imbrenda
2021-05-27 10:35   ` Janosch Frank
2021-05-17 20:07 ` [PATCH v1 11/11] KVM: s390: pv: add support for UV feature bits Claudio Imbrenda
2021-05-18 15:05 ` [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy Cornelia Huck
2021-05-18 15:36   ` Claudio Imbrenda
2021-05-18 15:45     ` Christian Borntraeger
2021-05-18 15:52       ` Cornelia Huck
2021-05-18 16:13       ` Claudio Imbrenda
2021-05-18 16:20         ` Christian Borntraeger
2021-05-18 16:34           ` Claudio Imbrenda
2021-05-18 16:35             ` Christian Borntraeger
2021-05-18 16:04     ` Cornelia Huck
2021-05-18 16:19       ` Claudio Imbrenda
2021-05-18 16:22         ` David Hildenbrand
2021-05-18 16:31           ` Claudio Imbrenda
2021-05-18 16:55             ` Christian Borntraeger
2021-05-18 17:00               ` Claudio Imbrenda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).