All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-10 21:04 ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe
  Cc: Jing Zhang

This patch is to reduce the performance degradation of guest workload during
dirty logging on ARM64. A fast path is added to handle permission relaxation
during dirty logging. The MMU lock is replaced with rwlock, by which all
permision relaxations on leaf pte can be performed under the read lock. This
greatly reduces the MMU lock contention during dirty logging. With this
solution, the source guest workload performance degradation can be improved
by more than 60%.

Problem:
  * A Google internal live migration test shows that the source guest workload
  performance has >99% degradation for about 105 seconds, >50% degradation
  for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
  This shows that most of the time, the guest workload degradtion is above
  99%, which obviously needs some improvement compared to the test result
  on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
  * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
  * VM spec: #vCPU: 48, #Mem/vCPU: 4GB

Analysis:
  * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
    the number of contentions of MMU lock and the "dirty memory time" on
    various VM spec.
    By using test command
    ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
    Below are the results:
    +-------+------------------------+-----------------------+
    | #vCPU | dirty memory time (ms) | number of contentions |
    +-------+------------------------+-----------------------+
    | 1     | 926                    | 0                     |
    +-------+------------------------+-----------------------+
    | 2     | 1189                   | 4732558               |
    +-------+------------------------+-----------------------+
    | 4     | 2503                   | 11527185              |
    +-------+------------------------+-----------------------+
    | 8     | 5069                   | 24881677              |
    +-------+------------------------+-----------------------+
    | 16    | 10340                  | 50347956              |
    +-------+------------------------+-----------------------+
    | 32    | 20351                  | 100605720             |
    +-------+------------------------+-----------------------+
    | 64    | 40994                  | 201442478             |
    +-------+------------------------+-----------------------+

  * From the test results above, the "dirty memory time" and the number of
    MMU lock contention scale with the number of vCPUs. That means all the
    dirty memory operations from all vCPU threads have been serialized by
    the MMU lock. Further analysis also shows that the permission relaxation
    during dirty logging is where vCPU threads get serialized.

Solution:
  * On ARM64, there is no mechanism as PML (Page Modification Logging) and
    the dirty-bit solution for dirty logging is much complicated compared to
    the write-protection solution. The straight way to reduce the guest
    performance degradation is to enhance the concurrency for the permission
    fault path during dirty logging.
  * In this patch, we only put leaf PTE permission relaxation for dirty
    logging under read lock, all others would go under write lock.
    Below are the results based on the solution:
    +-------+------------------------+
    | #vCPU | dirty memory time (ms) |
    +-------+------------------------+
    | 1     | 803                    |
    +-------+------------------------+
    | 2     | 843                    |
    +-------+------------------------+
    | 4     | 942                    |
    +-------+------------------------+
    | 8     | 1458                   |
    +-------+------------------------+
    | 16    | 2853                   |
    +-------+------------------------+
    | 32    | 5886                   |
    +-------+------------------------+
    | 64    | 12190                  |
    +-------+------------------------+
    All "dirty memory time" have been reduced by more than 60% when the
    number of vCPU grows.
    
---

Jing Zhang (3):
  KVM: arm64: Use read/write spin lock for MMU protection
  KVM: arm64: Add fast path to handle permission relaxation during dirty
    logging
  KVM: selftests: Add vgic initialization for dirty log perf test for
    ARM

 arch/arm64/include/asm/kvm_host.h             |  2 +
 arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
 3 files changed, 80 insertions(+), 18 deletions(-)


base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
-- 
2.34.1.575.g55b058a8bb-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-10 21:04 ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

This patch is to reduce the performance degradation of guest workload during
dirty logging on ARM64. A fast path is added to handle permission relaxation
during dirty logging. The MMU lock is replaced with rwlock, by which all
permision relaxations on leaf pte can be performed under the read lock. This
greatly reduces the MMU lock contention during dirty logging. With this
solution, the source guest workload performance degradation can be improved
by more than 60%.

Problem:
  * A Google internal live migration test shows that the source guest workload
  performance has >99% degradation for about 105 seconds, >50% degradation
  for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
  This shows that most of the time, the guest workload degradtion is above
  99%, which obviously needs some improvement compared to the test result
  on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
  * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
  * VM spec: #vCPU: 48, #Mem/vCPU: 4GB

Analysis:
  * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
    the number of contentions of MMU lock and the "dirty memory time" on
    various VM spec.
    By using test command
    ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
    Below are the results:
    +-------+------------------------+-----------------------+
    | #vCPU | dirty memory time (ms) | number of contentions |
    +-------+------------------------+-----------------------+
    | 1     | 926                    | 0                     |
    +-------+------------------------+-----------------------+
    | 2     | 1189                   | 4732558               |
    +-------+------------------------+-----------------------+
    | 4     | 2503                   | 11527185              |
    +-------+------------------------+-----------------------+
    | 8     | 5069                   | 24881677              |
    +-------+------------------------+-----------------------+
    | 16    | 10340                  | 50347956              |
    +-------+------------------------+-----------------------+
    | 32    | 20351                  | 100605720             |
    +-------+------------------------+-----------------------+
    | 64    | 40994                  | 201442478             |
    +-------+------------------------+-----------------------+

  * From the test results above, the "dirty memory time" and the number of
    MMU lock contention scale with the number of vCPUs. That means all the
    dirty memory operations from all vCPU threads have been serialized by
    the MMU lock. Further analysis also shows that the permission relaxation
    during dirty logging is where vCPU threads get serialized.

Solution:
  * On ARM64, there is no mechanism as PML (Page Modification Logging) and
    the dirty-bit solution for dirty logging is much complicated compared to
    the write-protection solution. The straight way to reduce the guest
    performance degradation is to enhance the concurrency for the permission
    fault path during dirty logging.
  * In this patch, we only put leaf PTE permission relaxation for dirty
    logging under read lock, all others would go under write lock.
    Below are the results based on the solution:
    +-------+------------------------+
    | #vCPU | dirty memory time (ms) |
    +-------+------------------------+
    | 1     | 803                    |
    +-------+------------------------+
    | 2     | 843                    |
    +-------+------------------------+
    | 4     | 942                    |
    +-------+------------------------+
    | 8     | 1458                   |
    +-------+------------------------+
    | 16    | 2853                   |
    +-------+------------------------+
    | 32    | 5886                   |
    +-------+------------------------+
    | 64    | 12190                  |
    +-------+------------------------+
    All "dirty memory time" have been reduced by more than 60% when the
    number of vCPU grows.
    
---

Jing Zhang (3):
  KVM: arm64: Use read/write spin lock for MMU protection
  KVM: arm64: Add fast path to handle permission relaxation during dirty
    logging
  KVM: selftests: Add vgic initialization for dirty log perf test for
    ARM

 arch/arm64/include/asm/kvm_host.h             |  2 +
 arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
 3 files changed, 80 insertions(+), 18 deletions(-)


base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
-- 
2.34.1.575.g55b058a8bb-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection
  2022-01-10 21:04 ` Jing Zhang
@ 2022-01-10 21:04   ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe
  Cc: Jing Zhang

To reduce the contentions caused by MMU lock, some MMU operations can
be performed under read lock.
One improvement is to add a fast path for permission relaxation during
dirty logging under the read lock.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/mmu.c              | 36 +++++++++++++++----------------
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3b44ea17af88..6c99c0335bae 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -50,6 +50,8 @@
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
 
+#define KVM_HAVE_MMU_RWLOCK
+
 /*
  * Mode of operation configurable with kvm-arm.mode early param.
  * See Documentation/admin-guide/kernel-parameters.txt for more information.
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bc2aba953299..cafd5813c949 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -58,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 			break;
 
 		if (resched && next != end)
-			cond_resched_lock(&kvm->mmu_lock);
+			cond_resched_rwlock_write(&kvm->mmu_lock);
 	} while (addr = next, addr != end);
 
 	return ret;
@@ -179,7 +179,7 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
 	phys_addr_t end = start + size;
 
-	assert_spin_locked(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	WARN_ON(size & ~PAGE_MASK);
 	WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
 				   may_block));
@@ -213,13 +213,13 @@ static void stage2_flush_vm(struct kvm *kvm)
 	int idx, bkt;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, bkt, slots)
 		stage2_flush_memslot(kvm, memslot);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -720,13 +720,13 @@ void stage2_unmap_vm(struct kvm *kvm)
 
 	idx = srcu_read_lock(&kvm->srcu);
 	mmap_read_lock(current->mm);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, bkt, slots)
 		stage2_unmap_memslot(kvm, memslot);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	mmap_read_unlock(current->mm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
@@ -736,14 +736,14 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
 	struct kvm_pgtable *pgt = NULL;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	pgt = mmu->pgt;
 	if (pgt) {
 		mmu->pgd_phys = 0;
 		mmu->pgt = NULL;
 		free_percpu(mmu->last_vcpu_ran);
 	}
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	if (pgt) {
 		kvm_pgtable_stage2_destroy(pgt);
@@ -783,10 +783,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		if (ret)
 			break;
 
-		spin_lock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
 					     &cache);
-		spin_unlock(&kvm->mmu_lock);
+		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
 
@@ -834,9 +834,9 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	start = memslot->base_gfn << PAGE_SHIFT;
 	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	stage2_wp_range(&kvm->arch.mmu, start, end);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	kvm_flush_remote_tlbs(kvm);
 }
 
@@ -1212,7 +1212,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1271,7 +1271,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 out_unlock:
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
 	return ret != -EAGAIN ? ret : 0;
@@ -1286,10 +1286,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 
 	trace_kvm_access_fault(fault_ipa);
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	mmu = vcpu->arch.hw_mmu;
 	kpte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 
 	pte = __pte(kpte);
 	if (pte_valid(pte))
@@ -1692,9 +1692,9 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	gpa_t gpa = slot->base_gfn << PAGE_SHIFT;
 	phys_addr_t size = slot->npages << PAGE_SHIFT;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	unmap_stage2_range(&kvm->arch.mmu, gpa, size);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 /*
-- 
2.34.1.575.g55b058a8bb-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection
@ 2022-01-10 21:04   ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

To reduce the contentions caused by MMU lock, some MMU operations can
be performed under read lock.
One improvement is to add a fast path for permission relaxation during
dirty logging under the read lock.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
---
 arch/arm64/include/asm/kvm_host.h |  2 ++
 arch/arm64/kvm/mmu.c              | 36 +++++++++++++++----------------
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 3b44ea17af88..6c99c0335bae 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -50,6 +50,8 @@
 #define KVM_DIRTY_LOG_MANUAL_CAPS   (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE | \
 				     KVM_DIRTY_LOG_INITIALLY_SET)
 
+#define KVM_HAVE_MMU_RWLOCK
+
 /*
  * Mode of operation configurable with kvm-arm.mode early param.
  * See Documentation/admin-guide/kernel-parameters.txt for more information.
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index bc2aba953299..cafd5813c949 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -58,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr,
 			break;
 
 		if (resched && next != end)
-			cond_resched_lock(&kvm->mmu_lock);
+			cond_resched_rwlock_write(&kvm->mmu_lock);
 	} while (addr = next, addr != end);
 
 	return ret;
@@ -179,7 +179,7 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
 	phys_addr_t end = start + size;
 
-	assert_spin_locked(&kvm->mmu_lock);
+	lockdep_assert_held_write(&kvm->mmu_lock);
 	WARN_ON(size & ~PAGE_MASK);
 	WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
 				   may_block));
@@ -213,13 +213,13 @@ static void stage2_flush_vm(struct kvm *kvm)
 	int idx, bkt;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, bkt, slots)
 		stage2_flush_memslot(kvm, memslot);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
@@ -720,13 +720,13 @@ void stage2_unmap_vm(struct kvm *kvm)
 
 	idx = srcu_read_lock(&kvm->srcu);
 	mmap_read_lock(current->mm);
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 
 	slots = kvm_memslots(kvm);
 	kvm_for_each_memslot(memslot, bkt, slots)
 		stage2_unmap_memslot(kvm, memslot);
 
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	mmap_read_unlock(current->mm);
 	srcu_read_unlock(&kvm->srcu, idx);
 }
@@ -736,14 +736,14 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
 	struct kvm_pgtable *pgt = NULL;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	pgt = mmu->pgt;
 	if (pgt) {
 		mmu->pgd_phys = 0;
 		mmu->pgt = NULL;
 		free_percpu(mmu->last_vcpu_ran);
 	}
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 
 	if (pgt) {
 		kvm_pgtable_stage2_destroy(pgt);
@@ -783,10 +783,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
 		if (ret)
 			break;
 
-		spin_lock(&kvm->mmu_lock);
+		write_lock(&kvm->mmu_lock);
 		ret = kvm_pgtable_stage2_map(pgt, addr, PAGE_SIZE, pa, prot,
 					     &cache);
-		spin_unlock(&kvm->mmu_lock);
+		write_unlock(&kvm->mmu_lock);
 		if (ret)
 			break;
 
@@ -834,9 +834,9 @@ static void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
 	start = memslot->base_gfn << PAGE_SHIFT;
 	end = (memslot->base_gfn + memslot->npages) << PAGE_SHIFT;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	stage2_wp_range(&kvm->arch.mmu, start, end);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	kvm_flush_remote_tlbs(kvm);
 }
 
@@ -1212,7 +1212,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	pgt = vcpu->arch.hw_mmu->pgt;
 	if (mmu_notifier_retry(kvm, mmu_seq))
 		goto out_unlock;
@@ -1271,7 +1271,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	}
 
 out_unlock:
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 	kvm_set_pfn_accessed(pfn);
 	kvm_release_pfn_clean(pfn);
 	return ret != -EAGAIN ? ret : 0;
@@ -1286,10 +1286,10 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
 
 	trace_kvm_access_fault(fault_ipa);
 
-	spin_lock(&vcpu->kvm->mmu_lock);
+	write_lock(&vcpu->kvm->mmu_lock);
 	mmu = vcpu->arch.hw_mmu;
 	kpte = kvm_pgtable_stage2_mkyoung(mmu->pgt, fault_ipa);
-	spin_unlock(&vcpu->kvm->mmu_lock);
+	write_unlock(&vcpu->kvm->mmu_lock);
 
 	pte = __pte(kpte);
 	if (pte_valid(pte))
@@ -1692,9 +1692,9 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
 	gpa_t gpa = slot->base_gfn << PAGE_SHIFT;
 	phys_addr_t size = slot->npages << PAGE_SHIFT;
 
-	spin_lock(&kvm->mmu_lock);
+	write_lock(&kvm->mmu_lock);
 	unmap_stage2_range(&kvm->arch.mmu, gpa, size);
-	spin_unlock(&kvm->mmu_lock);
+	write_unlock(&kvm->mmu_lock);
 }
 
 /*
-- 
2.34.1.575.g55b058a8bb-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
  2022-01-10 21:04 ` Jing Zhang
@ 2022-01-10 21:04   ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe
  Cc: Jing Zhang

To reduce MMU lock contention during dirty logging, all permission
relaxation operations would be performed under read lock.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
---
 arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cafd5813c949..dd1f43fba4b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 	return 0;
 }
 
+static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+		struct kvm_memory_slot *memslot, unsigned long fault_status)
+{
+	int ret;
+	bool writable;
+	bool write_fault = kvm_is_write_fault(vcpu);
+	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
+	kvm_pfn_t pfn;
+	struct kvm *kvm = vcpu->kvm;
+	bool logging_active = memslot_is_logging(memslot);
+	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
+	unsigned long fault_granule;
+
+	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
+
+	/* Make sure the fault can be handled in the fast path.
+	 * Only handle write permission fault on non-hugepage during dirty
+	 * logging period.
+	 */
+	if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
+			|| !logging_active || !write_fault)
+		return false;
+
+
+	/* Pin the pfn to make sure it couldn't be freed and be resued for
+	 * another gfn.
+	 */
+	pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
+				   write_fault, &writable, NULL);
+	if (is_error_pfn(pfn) || !writable)
+		return false;
+
+	read_lock(&kvm->mmu_lock);
+	ret = kvm_pgtable_stage2_relax_perms(
+			vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);
+
+	if (!ret) {
+		kvm_set_pfn_dirty(pfn);
+		mark_page_dirty_in_slot(kvm, memslot, gfn);
+	}
+	read_unlock(&kvm->mmu_lock);
+
+	kvm_set_pfn_accessed(pfn);
+	kvm_release_pfn_clean(pfn);
+
+	return true;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 
+	if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
+		return 0;
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
-- 
2.34.1.575.g55b058a8bb-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
@ 2022-01-10 21:04   ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

To reduce MMU lock contention during dirty logging, all permission
relaxation operations would be performed under read lock.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
---
 arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index cafd5813c949..dd1f43fba4b0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
 	return 0;
 }
 
+static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+		struct kvm_memory_slot *memslot, unsigned long fault_status)
+{
+	int ret;
+	bool writable;
+	bool write_fault = kvm_is_write_fault(vcpu);
+	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
+	kvm_pfn_t pfn;
+	struct kvm *kvm = vcpu->kvm;
+	bool logging_active = memslot_is_logging(memslot);
+	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
+	unsigned long fault_granule;
+
+	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
+
+	/* Make sure the fault can be handled in the fast path.
+	 * Only handle write permission fault on non-hugepage during dirty
+	 * logging period.
+	 */
+	if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
+			|| !logging_active || !write_fault)
+		return false;
+
+
+	/* Pin the pfn to make sure it couldn't be freed and be resued for
+	 * another gfn.
+	 */
+	pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
+				   write_fault, &writable, NULL);
+	if (is_error_pfn(pfn) || !writable)
+		return false;
+
+	read_lock(&kvm->mmu_lock);
+	ret = kvm_pgtable_stage2_relax_perms(
+			vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);
+
+	if (!ret) {
+		kvm_set_pfn_dirty(pfn);
+		mark_page_dirty_in_slot(kvm, memslot, gfn);
+	}
+	read_unlock(&kvm->mmu_lock);
+
+	kvm_set_pfn_accessed(pfn);
+	kvm_release_pfn_clean(pfn);
+
+	return true;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  unsigned long fault_status)
@@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 
+	if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
+		return 0;
 	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
 	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
-- 
2.34.1.575.g55b058a8bb-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-10 21:04 ` Jing Zhang
@ 2022-01-10 21:04   ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe
  Cc: Jing Zhang

For ARM64, if no vgic is setup before the dirty log perf test, the
userspace irqchip would be used, which would affect the dirty log perf
test result.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 1954b964d1cf..b501338d9430 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -18,6 +18,12 @@
 #include "test_util.h"
 #include "perf_test_util.h"
 #include "guest_modes.h"
+#ifdef __aarch64__
+#include "aarch64/vgic.h"
+
+#define GICD_BASE_GPA			0x8000000ULL
+#define GICR_BASE_GPA			0x80A0000ULL
+#endif
 
 /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
 #define TEST_HOST_LOOP_N		2UL
@@ -200,6 +206,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		vm_enable_cap(vm, &cap);
 	}
 
+#ifdef __aarch64__
+	vgic_v3_setup(vm, nr_vcpus, 64, GICD_BASE_GPA, GICR_BASE_GPA);
+#endif
+
 	/* Start the iterations */
 	iteration = 0;
 	host_quit = false;
-- 
2.34.1.575.g55b058a8bb-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-10 21:04   ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-10 21:04 UTC (permalink / raw)
  To: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

For ARM64, if no vgic is setup before the dirty log perf test, the
userspace irqchip would be used, which would affect the dirty log perf
test result.

Signed-off-by: Jing Zhang <jingzhangos@google.com>
---
 tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index 1954b964d1cf..b501338d9430 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -18,6 +18,12 @@
 #include "test_util.h"
 #include "perf_test_util.h"
 #include "guest_modes.h"
+#ifdef __aarch64__
+#include "aarch64/vgic.h"
+
+#define GICD_BASE_GPA			0x8000000ULL
+#define GICR_BASE_GPA			0x80A0000ULL
+#endif
 
 /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
 #define TEST_HOST_LOOP_N		2UL
@@ -200,6 +206,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		vm_enable_cap(vm, &cap);
 	}
 
+#ifdef __aarch64__
+	vgic_v3_setup(vm, nr_vcpus, 64, GICD_BASE_GPA, GICR_BASE_GPA);
+#endif
+
 	/* Start the iterations */
 	iteration = 0;
 	host_quit = false;
-- 
2.34.1.575.g55b058a8bb-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-10 21:04   ` Jing Zhang
@ 2022-01-11  9:55     ` Andrew Jones
  -1 siblings, 0 replies; 40+ messages in thread
From: Andrew Jones @ 2022-01-11  9:55 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

On Mon, Jan 10, 2022 at 09:04:41PM +0000, Jing Zhang wrote:
> For ARM64, if no vgic is setup before the dirty log perf test, the
> userspace irqchip would be used, which would affect the dirty log perf
> test result.
> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 1954b964d1cf..b501338d9430 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -18,6 +18,12 @@
>  #include "test_util.h"
>  #include "perf_test_util.h"
>  #include "guest_modes.h"
> +#ifdef __aarch64__
> +#include "aarch64/vgic.h"
> +
> +#define GICD_BASE_GPA			0x8000000ULL
> +#define GICR_BASE_GPA			0x80A0000ULL
> +#endif
>  
>  /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
>  #define TEST_HOST_LOOP_N		2UL
> @@ -200,6 +206,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  		vm_enable_cap(vm, &cap);
>  	}
>  
> +#ifdef __aarch64__
> +	vgic_v3_setup(vm, nr_vcpus, 64, GICD_BASE_GPA, GICR_BASE_GPA);
                                    ^^ extra parameter

Thanks,
drew

> +#endif
> +
>  	/* Start the iterations */
>  	iteration = 0;
>  	host_quit = false;
> -- 
> 2.34.1.575.g55b058a8bb-goog
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-11  9:55     ` Andrew Jones
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Jones @ 2022-01-11  9:55 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, Marc Zyngier, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Mon, Jan 10, 2022 at 09:04:41PM +0000, Jing Zhang wrote:
> For ARM64, if no vgic is setup before the dirty log perf test, the
> userspace irqchip would be used, which would affect the dirty log perf
> test result.
> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 1954b964d1cf..b501338d9430 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -18,6 +18,12 @@
>  #include "test_util.h"
>  #include "perf_test_util.h"
>  #include "guest_modes.h"
> +#ifdef __aarch64__
> +#include "aarch64/vgic.h"
> +
> +#define GICD_BASE_GPA			0x8000000ULL
> +#define GICR_BASE_GPA			0x80A0000ULL
> +#endif
>  
>  /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
>  #define TEST_HOST_LOOP_N		2UL
> @@ -200,6 +206,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  		vm_enable_cap(vm, &cap);
>  	}
>  
> +#ifdef __aarch64__
> +	vgic_v3_setup(vm, nr_vcpus, 64, GICD_BASE_GPA, GICR_BASE_GPA);
                                    ^^ extra parameter

Thanks,
drew

> +#endif
> +
>  	/* Start the iterations */
>  	iteration = 0;
>  	host_quit = false;
> -- 
> 2.34.1.575.g55b058a8bb-goog
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
  2022-01-10 21:04   ` Jing Zhang
@ 2022-01-11 10:22     ` Marc Zyngier
  -1 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:22 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

On Mon, 10 Jan 2022 21:04:40 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> To reduce MMU lock contention during dirty logging, all permission
> relaxation operations would be performed under read lock.
> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cafd5813c949..dd1f43fba4b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>  	return 0;
>  }
>  
> +static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> +		struct kvm_memory_slot *memslot, unsigned long fault_status)
> +{
> +	int ret;
> +	bool writable;
> +	bool write_fault = kvm_is_write_fault(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	kvm_pfn_t pfn;
> +	struct kvm *kvm = vcpu->kvm;
> +	bool logging_active = memslot_is_logging(memslot);
> +	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> +	unsigned long fault_granule;
> +
> +	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> +
> +	/* Make sure the fault can be handled in the fast path.
> +	 * Only handle write permission fault on non-hugepage during dirty
> +	 * logging period.
> +	 */

Not the correct comment format.

> +	if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
> +			|| !logging_active || !write_fault)
> +		return false;

This is all reinventing the logic that already exists in
user_mem_abort(). I'm sympathetic to the effort not to bloat it even
more, but code duplication doesn't help either.

> +
> +
> +	/* Pin the pfn to make sure it couldn't be freed and be resued for
> +	 * another gfn.
> +	 */
> +	pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
> +				   write_fault, &writable, NULL);
> +	if (is_error_pfn(pfn) || !writable)
> +		return false;

What happens if we hit a non-writable mapping? Don't we leak a page
reference?

> +
> +	read_lock(&kvm->mmu_lock);
> +	ret = kvm_pgtable_stage2_relax_perms(
> +			vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);

PAGE_HYP? Err... no. KVM_PGTABLE_PROT_RW, more likely. Yes, they
expand to the same thing, but you are not dealing with nVHE EL2 S1
page tables here.

> +
> +	if (!ret) {
> +		kvm_set_pfn_dirty(pfn);
> +		mark_page_dirty_in_slot(kvm, memslot, gfn);
> +	}
> +	read_unlock(&kvm->mmu_lock);
> +
> +	kvm_set_pfn_accessed(pfn);
> +	kvm_release_pfn_clean(pfn);
> +
> +	return true;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
>  
> +	if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
> +		return 0;
>  	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>  	write_fault = kvm_is_write_fault(vcpu);
>  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);

You are bypassing all sort of checks that I want to keep. Please
integrate this in user_mem_abort instead of this side hack.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
@ 2022-01-11 10:22     ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:22 UTC (permalink / raw)
  To: Jing Zhang; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Mon, 10 Jan 2022 21:04:40 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> To reduce MMU lock contention during dirty logging, all permission
> relaxation operations would be performed under read lock.
> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cafd5813c949..dd1f43fba4b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>  	return 0;
>  }
>  
> +static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> +		struct kvm_memory_slot *memslot, unsigned long fault_status)
> +{
> +	int ret;
> +	bool writable;
> +	bool write_fault = kvm_is_write_fault(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	kvm_pfn_t pfn;
> +	struct kvm *kvm = vcpu->kvm;
> +	bool logging_active = memslot_is_logging(memslot);
> +	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> +	unsigned long fault_granule;
> +
> +	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> +
> +	/* Make sure the fault can be handled in the fast path.
> +	 * Only handle write permission fault on non-hugepage during dirty
> +	 * logging period.
> +	 */

Not the correct comment format.

> +	if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
> +			|| !logging_active || !write_fault)
> +		return false;

This is all reinventing the logic that already exists in
user_mem_abort(). I'm sympathetic to the effort not to bloat it even
more, but code duplication doesn't help either.

> +
> +
> +	/* Pin the pfn to make sure it couldn't be freed and be resued for
> +	 * another gfn.
> +	 */
> +	pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
> +				   write_fault, &writable, NULL);
> +	if (is_error_pfn(pfn) || !writable)
> +		return false;

What happens if we hit a non-writable mapping? Don't we leak a page
reference?

> +
> +	read_lock(&kvm->mmu_lock);
> +	ret = kvm_pgtable_stage2_relax_perms(
> +			vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);

PAGE_HYP? Err... no. KVM_PGTABLE_PROT_RW, more likely. Yes, they
expand to the same thing, but you are not dealing with nVHE EL2 S1
page tables here.

> +
> +	if (!ret) {
> +		kvm_set_pfn_dirty(pfn);
> +		mark_page_dirty_in_slot(kvm, memslot, gfn);
> +	}
> +	read_unlock(&kvm->mmu_lock);
> +
> +	kvm_set_pfn_accessed(pfn);
> +	kvm_release_pfn_clean(pfn);
> +
> +	return true;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
>  
> +	if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
> +		return 0;
>  	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>  	write_fault = kvm_is_write_fault(vcpu);
>  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);

You are bypassing all sort of checks that I want to keep. Please
integrate this in user_mem_abort instead of this side hack.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection
  2022-01-10 21:04   ` Jing Zhang
@ 2022-01-11 10:23     ` Marc Zyngier
  -1 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:23 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

On Mon, 10 Jan 2022 21:04:39 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> To reduce the contentions caused by MMU lock, some MMU operations can
> be performed under read lock.
> One improvement is to add a fast path for permission relaxation during
> dirty logging under the read lock.

This commit message really doesn't say what this patch does
(converting our MMU spinlock to a rwlock, and replacing all instances
of the lock being acquired with a write lock acquisition). Crucially,
it only mention the read lock which appears *nowhere* in this patch.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection
@ 2022-01-11 10:23     ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:23 UTC (permalink / raw)
  To: Jing Zhang; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Mon, 10 Jan 2022 21:04:39 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> To reduce the contentions caused by MMU lock, some MMU operations can
> be performed under read lock.
> One improvement is to add a fast path for permission relaxation during
> dirty logging under the read lock.

This commit message really doesn't say what this patch does
(converting our MMU spinlock to a rwlock, and replacing all instances
of the lock being acquired with a write lock acquisition). Crucially,
it only mention the read lock which appears *nowhere* in this patch.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-10 21:04   ` Jing Zhang
@ 2022-01-11 10:30     ` Marc Zyngier
  -1 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:30 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

On Mon, 10 Jan 2022 21:04:41 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> For ARM64, if no vgic is setup before the dirty log perf test, the
> userspace irqchip would be used, which would affect the dirty log perf
> test result.

Doesn't it affect *all* performance tests? How much does this change
contributes to the performance numbers you give in the cover letter?

> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 1954b964d1cf..b501338d9430 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -18,6 +18,12 @@
>  #include "test_util.h"
>  #include "perf_test_util.h"
>  #include "guest_modes.h"
> +#ifdef __aarch64__
> +#include "aarch64/vgic.h"
> +
> +#define GICD_BASE_GPA			0x8000000ULL
> +#define GICR_BASE_GPA			0x80A0000ULL

How did you pick these values?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-11 10:30     ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:30 UTC (permalink / raw)
  To: Jing Zhang; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Mon, 10 Jan 2022 21:04:41 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> For ARM64, if no vgic is setup before the dirty log perf test, the
> userspace irqchip would be used, which would affect the dirty log perf
> test result.

Doesn't it affect *all* performance tests? How much does this change
contributes to the performance numbers you give in the cover letter?

> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> index 1954b964d1cf..b501338d9430 100644
> --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> @@ -18,6 +18,12 @@
>  #include "test_util.h"
>  #include "perf_test_util.h"
>  #include "guest_modes.h"
> +#ifdef __aarch64__
> +#include "aarch64/vgic.h"
> +
> +#define GICD_BASE_GPA			0x8000000ULL
> +#define GICR_BASE_GPA			0x80A0000ULL

How did you pick these values?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
  2022-01-10 21:04   ` Jing Zhang
@ 2022-01-11 10:50     ` Marc Zyngier
  -1 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:50 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

Coming back to this, as it does bother me.

On Mon, 10 Jan 2022 21:04:40 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> To reduce MMU lock contention during dirty logging, all permission
> relaxation operations would be performed under read lock.
> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cafd5813c949..dd1f43fba4b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>  	return 0;
>  }
>  
> +static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> +		struct kvm_memory_slot *memslot, unsigned long fault_status)
> +{
> +	int ret;
> +	bool writable;
> +	bool write_fault = kvm_is_write_fault(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	kvm_pfn_t pfn;
> +	struct kvm *kvm = vcpu->kvm;
> +	bool logging_active = memslot_is_logging(memslot);
> +	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> +	unsigned long fault_granule;
> +
> +	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> +
> +	/* Make sure the fault can be handled in the fast path.
> +	 * Only handle write permission fault on non-hugepage during dirty
> +	 * logging period.
> +	 */
> +	if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
> +			|| !logging_active || !write_fault)
> +		return false;
> +
> +
> +	/* Pin the pfn to make sure it couldn't be freed and be resued for
> +	 * another gfn.
> +	 */
> +	pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
> +				   write_fault, &writable, NULL);

Why the requirement to be atomic? Once this returns, the page will
have an elevated refcount, atomic or not. Given that we're not in an
environment that requires atomicity (we're fully preemptible at this
stage), I wonder what this is achieving.

> +	if (is_error_pfn(pfn) || !writable)
> +		return false;
> +
> +	read_lock(&kvm->mmu_lock);

You also dropped the hazarding against a concurrent MMU notifier. Why
is it safe to do so?

> +	ret = kvm_pgtable_stage2_relax_perms(
> +			vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);
> +
> +	if (!ret) {
> +		kvm_set_pfn_dirty(pfn);
> +		mark_page_dirty_in_slot(kvm, memslot, gfn);
> +	}
> +	read_unlock(&kvm->mmu_lock);
> +
> +	kvm_set_pfn_accessed(pfn);
> +	kvm_release_pfn_clean(pfn);
> +
> +	return true;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
>  
> +	if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
> +		return 0;
>  	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>  	write_fault = kvm_is_write_fault(vcpu);
>  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
@ 2022-01-11 10:50     ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 10:50 UTC (permalink / raw)
  To: Jing Zhang; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

Coming back to this, as it does bother me.

On Mon, 10 Jan 2022 21:04:40 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> To reduce MMU lock contention during dirty logging, all permission
> relaxation operations would be performed under read lock.
> 
> Signed-off-by: Jing Zhang <jingzhangos@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 50 insertions(+)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index cafd5813c949..dd1f43fba4b0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
>  	return 0;
>  }
>  
> +static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> +		struct kvm_memory_slot *memslot, unsigned long fault_status)
> +{
> +	int ret;
> +	bool writable;
> +	bool write_fault = kvm_is_write_fault(vcpu);
> +	gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> +	kvm_pfn_t pfn;
> +	struct kvm *kvm = vcpu->kvm;
> +	bool logging_active = memslot_is_logging(memslot);
> +	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> +	unsigned long fault_granule;
> +
> +	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> +
> +	/* Make sure the fault can be handled in the fast path.
> +	 * Only handle write permission fault on non-hugepage during dirty
> +	 * logging period.
> +	 */
> +	if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
> +			|| !logging_active || !write_fault)
> +		return false;
> +
> +
> +	/* Pin the pfn to make sure it couldn't be freed and be resued for
> +	 * another gfn.
> +	 */
> +	pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
> +				   write_fault, &writable, NULL);

Why the requirement to be atomic? Once this returns, the page will
have an elevated refcount, atomic or not. Given that we're not in an
environment that requires atomicity (we're fully preemptible at this
stage), I wonder what this is achieving.

> +	if (is_error_pfn(pfn) || !writable)
> +		return false;
> +
> +	read_lock(&kvm->mmu_lock);

You also dropped the hazarding against a concurrent MMU notifier. Why
is it safe to do so?

> +	ret = kvm_pgtable_stage2_relax_perms(
> +			vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);
> +
> +	if (!ret) {
> +		kvm_set_pfn_dirty(pfn);
> +		mark_page_dirty_in_slot(kvm, memslot, gfn);
> +	}
> +	read_unlock(&kvm->mmu_lock);
> +
> +	kvm_set_pfn_accessed(pfn);
> +	kvm_release_pfn_clean(pfn);
> +
> +	return true;
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot, unsigned long hva,
>  			  unsigned long fault_status)
> @@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>  	struct kvm_pgtable *pgt;
>  
> +	if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
> +		return 0;
>  	fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>  	write_fault = kvm_is_write_fault(vcpu);
>  	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
  2022-01-10 21:04 ` Jing Zhang
@ 2022-01-11 11:54   ` Marc Zyngier
  -1 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 11:54 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

On Mon, 10 Jan 2022 21:04:38 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> This patch is to reduce the performance degradation of guest workload during
> dirty logging on ARM64. A fast path is added to handle permission relaxation
> during dirty logging. The MMU lock is replaced with rwlock, by which all
> permision relaxations on leaf pte can be performed under the read lock. This
> greatly reduces the MMU lock contention during dirty logging. With this
> solution, the source guest workload performance degradation can be improved
> by more than 60%.
> 
> Problem:
>   * A Google internal live migration test shows that the source guest workload
>   performance has >99% degradation for about 105 seconds, >50% degradation
>   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
>   This shows that most of the time, the guest workload degradtion is above
>   99%, which obviously needs some improvement compared to the test result
>   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
>   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
>   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB

What are the host and guest page sizes?

> 
> Analysis:
>   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
>     the number of contentions of MMU lock and the "dirty memory time" on
>     various VM spec.
>     By using test command
>     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]

How is this test representative of the internal live migration test
you mention above? '-m 2' indicates a mode that varies depending on
the HW and revision of the test (I just added a bunch of supported
modes). Which one is it?

>     Below are the results:
>     +-------+------------------------+-----------------------+
>     | #vCPU | dirty memory time (ms) | number of contentions |
>     +-------+------------------------+-----------------------+
>     | 1     | 926                    | 0                     |
>     +-------+------------------------+-----------------------+
>     | 2     | 1189                   | 4732558               |
>     +-------+------------------------+-----------------------+
>     | 4     | 2503                   | 11527185              |
>     +-------+------------------------+-----------------------+
>     | 8     | 5069                   | 24881677              |
>     +-------+------------------------+-----------------------+
>     | 16    | 10340                  | 50347956              |
>     +-------+------------------------+-----------------------+
>     | 32    | 20351                  | 100605720             |
>     +-------+------------------------+-----------------------+
>     | 64    | 40994                  | 201442478             |
>     +-------+------------------------+-----------------------+
> 
>   * From the test results above, the "dirty memory time" and the number of
>     MMU lock contention scale with the number of vCPUs. That means all the
>     dirty memory operations from all vCPU threads have been serialized by
>     the MMU lock. Further analysis also shows that the permission relaxation
>     during dirty logging is where vCPU threads get serialized.
> 
> Solution:
>   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
>     the dirty-bit solution for dirty logging is much complicated compared to
>     the write-protection solution. The straight way to reduce the guest
>     performance degradation is to enhance the concurrency for the permission
>     fault path during dirty logging.
>   * In this patch, we only put leaf PTE permission relaxation for dirty
>     logging under read lock, all others would go under write lock.
>     Below are the results based on the solution:
>     +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 803                    |
>     +-------+------------------------+
>     | 2     | 843                    |
>     +-------+------------------------+
>     | 4     | 942                    |
>     +-------+------------------------+
>     | 8     | 1458                   |
>     +-------+------------------------+
>     | 16    | 2853                   |
>     +-------+------------------------+
>     | 32    | 5886                   |
>     +-------+------------------------+
>     | 64    | 12190                  |
>     +-------+------------------------+
>     All "dirty memory time" have been reduced by more than 60% when the
>     number of vCPU grows.

How does that translate to the original problem statement with your
live migration test?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-11 11:54   ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-11 11:54 UTC (permalink / raw)
  To: Jing Zhang; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Mon, 10 Jan 2022 21:04:38 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> This patch is to reduce the performance degradation of guest workload during
> dirty logging on ARM64. A fast path is added to handle permission relaxation
> during dirty logging. The MMU lock is replaced with rwlock, by which all
> permision relaxations on leaf pte can be performed under the read lock. This
> greatly reduces the MMU lock contention during dirty logging. With this
> solution, the source guest workload performance degradation can be improved
> by more than 60%.
> 
> Problem:
>   * A Google internal live migration test shows that the source guest workload
>   performance has >99% degradation for about 105 seconds, >50% degradation
>   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
>   This shows that most of the time, the guest workload degradtion is above
>   99%, which obviously needs some improvement compared to the test result
>   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
>   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
>   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB

What are the host and guest page sizes?

> 
> Analysis:
>   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
>     the number of contentions of MMU lock and the "dirty memory time" on
>     various VM spec.
>     By using test command
>     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]

How is this test representative of the internal live migration test
you mention above? '-m 2' indicates a mode that varies depending on
the HW and revision of the test (I just added a bunch of supported
modes). Which one is it?

>     Below are the results:
>     +-------+------------------------+-----------------------+
>     | #vCPU | dirty memory time (ms) | number of contentions |
>     +-------+------------------------+-----------------------+
>     | 1     | 926                    | 0                     |
>     +-------+------------------------+-----------------------+
>     | 2     | 1189                   | 4732558               |
>     +-------+------------------------+-----------------------+
>     | 4     | 2503                   | 11527185              |
>     +-------+------------------------+-----------------------+
>     | 8     | 5069                   | 24881677              |
>     +-------+------------------------+-----------------------+
>     | 16    | 10340                  | 50347956              |
>     +-------+------------------------+-----------------------+
>     | 32    | 20351                  | 100605720             |
>     +-------+------------------------+-----------------------+
>     | 64    | 40994                  | 201442478             |
>     +-------+------------------------+-----------------------+
> 
>   * From the test results above, the "dirty memory time" and the number of
>     MMU lock contention scale with the number of vCPUs. That means all the
>     dirty memory operations from all vCPU threads have been serialized by
>     the MMU lock. Further analysis also shows that the permission relaxation
>     during dirty logging is where vCPU threads get serialized.
> 
> Solution:
>   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
>     the dirty-bit solution for dirty logging is much complicated compared to
>     the write-protection solution. The straight way to reduce the guest
>     performance degradation is to enhance the concurrency for the permission
>     fault path during dirty logging.
>   * In this patch, we only put leaf PTE permission relaxation for dirty
>     logging under read lock, all others would go under write lock.
>     Below are the results based on the solution:
>     +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 803                    |
>     +-------+------------------------+
>     | 2     | 843                    |
>     +-------+------------------------+
>     | 4     | 942                    |
>     +-------+------------------------+
>     | 8     | 1458                   |
>     +-------+------------------------+
>     | 16    | 2853                   |
>     +-------+------------------------+
>     | 32    | 5886                   |
>     +-------+------------------------+
>     | 64    | 12190                  |
>     +-------+------------------------+
>     All "dirty memory time" have been reduced by more than 60% when the
>     number of vCPU grows.

How does that translate to the original problem statement with your
live migration test?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
  2022-01-11 11:54   ` Marc Zyngier
@ 2022-01-11 22:12     ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

On Tue, Jan 11, 2022 at 3:55 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:38 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > This patch is to reduce the performance degradation of guest workload during
> > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > permision relaxations on leaf pte can be performed under the read lock. This
> > greatly reduces the MMU lock contention during dirty logging. With this
> > solution, the source guest workload performance degradation can be improved
> > by more than 60%.
> >
> > Problem:
> >   * A Google internal live migration test shows that the source guest workload
> >   performance has >99% degradation for about 105 seconds, >50% degradation
> >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> >   This shows that most of the time, the guest workload degradtion is above
> >   99%, which obviously needs some improvement compared to the test result
> >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
>
> What are the host and guest page sizes?
Both are 4K and guest mem is 2M hugepage backed. Will add the info for
future posts.
>
> >
> > Analysis:
> >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> >     the number of contentions of MMU lock and the "dirty memory time" on
> >     various VM spec.
> >     By using test command
> >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
>
> How is this test representative of the internal live migration test
> you mention above? '-m 2' indicates a mode that varies depending on
> the HW and revision of the test (I just added a bunch of supported
> modes). Which one is it?
The "dirty memory time" is the time vCPU threads spent in KVM after
fault. Higher "dirty memory time" means higher degradation to guest
workload.
'-m 2' indicates mode "PA-bits:48,  VA-bits:48,  4K pages". Will add
this for future posts.
>
> >     Below are the results:
> >     +-------+------------------------+-----------------------+
> >     | #vCPU | dirty memory time (ms) | number of contentions |
> >     +-------+------------------------+-----------------------+
> >     | 1     | 926                    | 0                     |
> >     +-------+------------------------+-----------------------+
> >     | 2     | 1189                   | 4732558               |
> >     +-------+------------------------+-----------------------+
> >     | 4     | 2503                   | 11527185              |
> >     +-------+------------------------+-----------------------+
> >     | 8     | 5069                   | 24881677              |
> >     +-------+------------------------+-----------------------+
> >     | 16    | 10340                  | 50347956              |
> >     +-------+------------------------+-----------------------+
> >     | 32    | 20351                  | 100605720             |
> >     +-------+------------------------+-----------------------+
> >     | 64    | 40994                  | 201442478             |
> >     +-------+------------------------+-----------------------+
> >
> >   * From the test results above, the "dirty memory time" and the number of
> >     MMU lock contention scale with the number of vCPUs. That means all the
> >     dirty memory operations from all vCPU threads have been serialized by
> >     the MMU lock. Further analysis also shows that the permission relaxation
> >     during dirty logging is where vCPU threads get serialized.
> >
> > Solution:
> >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> >     the dirty-bit solution for dirty logging is much complicated compared to
> >     the write-protection solution. The straight way to reduce the guest
> >     performance degradation is to enhance the concurrency for the permission
> >     fault path during dirty logging.
> >   * In this patch, we only put leaf PTE permission relaxation for dirty
> >     logging under read lock, all others would go under write lock.
> >     Below are the results based on the solution:
> >     +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 803                    |
> >     +-------+------------------------+
> >     | 2     | 843                    |
> >     +-------+------------------------+
> >     | 4     | 942                    |
> >     +-------+------------------------+
> >     | 8     | 1458                   |
> >     +-------+------------------------+
> >     | 16    | 2853                   |
> >     +-------+------------------------+
> >     | 32    | 5886                   |
> >     +-------+------------------------+
> >     | 64    | 12190                  |
> >     +-------+------------------------+
> >     All "dirty memory time" have been reduced by more than 60% when the
> >     number of vCPU grows.
>
> How does that translate to the original problem statement with your
> live migration test?
Based on the solution, the test results from the Google internal live
migration test also shows more than 60% improvement with >99% for 30s,
>50% for 58s and >10% for 76s.
Will add this info in to future posts.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

Thanks,
Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-11 22:12     ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Tue, Jan 11, 2022 at 3:55 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:38 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > This patch is to reduce the performance degradation of guest workload during
> > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > permision relaxations on leaf pte can be performed under the read lock. This
> > greatly reduces the MMU lock contention during dirty logging. With this
> > solution, the source guest workload performance degradation can be improved
> > by more than 60%.
> >
> > Problem:
> >   * A Google internal live migration test shows that the source guest workload
> >   performance has >99% degradation for about 105 seconds, >50% degradation
> >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> >   This shows that most of the time, the guest workload degradtion is above
> >   99%, which obviously needs some improvement compared to the test result
> >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
>
> What are the host and guest page sizes?
Both are 4K and guest mem is 2M hugepage backed. Will add the info for
future posts.
>
> >
> > Analysis:
> >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> >     the number of contentions of MMU lock and the "dirty memory time" on
> >     various VM spec.
> >     By using test command
> >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
>
> How is this test representative of the internal live migration test
> you mention above? '-m 2' indicates a mode that varies depending on
> the HW and revision of the test (I just added a bunch of supported
> modes). Which one is it?
The "dirty memory time" is the time vCPU threads spent in KVM after
fault. Higher "dirty memory time" means higher degradation to guest
workload.
'-m 2' indicates mode "PA-bits:48,  VA-bits:48,  4K pages". Will add
this for future posts.
>
> >     Below are the results:
> >     +-------+------------------------+-----------------------+
> >     | #vCPU | dirty memory time (ms) | number of contentions |
> >     +-------+------------------------+-----------------------+
> >     | 1     | 926                    | 0                     |
> >     +-------+------------------------+-----------------------+
> >     | 2     | 1189                   | 4732558               |
> >     +-------+------------------------+-----------------------+
> >     | 4     | 2503                   | 11527185              |
> >     +-------+------------------------+-----------------------+
> >     | 8     | 5069                   | 24881677              |
> >     +-------+------------------------+-----------------------+
> >     | 16    | 10340                  | 50347956              |
> >     +-------+------------------------+-----------------------+
> >     | 32    | 20351                  | 100605720             |
> >     +-------+------------------------+-----------------------+
> >     | 64    | 40994                  | 201442478             |
> >     +-------+------------------------+-----------------------+
> >
> >   * From the test results above, the "dirty memory time" and the number of
> >     MMU lock contention scale with the number of vCPUs. That means all the
> >     dirty memory operations from all vCPU threads have been serialized by
> >     the MMU lock. Further analysis also shows that the permission relaxation
> >     during dirty logging is where vCPU threads get serialized.
> >
> > Solution:
> >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> >     the dirty-bit solution for dirty logging is much complicated compared to
> >     the write-protection solution. The straight way to reduce the guest
> >     performance degradation is to enhance the concurrency for the permission
> >     fault path during dirty logging.
> >   * In this patch, we only put leaf PTE permission relaxation for dirty
> >     logging under read lock, all others would go under write lock.
> >     Below are the results based on the solution:
> >     +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 803                    |
> >     +-------+------------------------+
> >     | 2     | 843                    |
> >     +-------+------------------------+
> >     | 4     | 942                    |
> >     +-------+------------------------+
> >     | 8     | 1458                   |
> >     +-------+------------------------+
> >     | 16    | 2853                   |
> >     +-------+------------------------+
> >     | 32    | 5886                   |
> >     +-------+------------------------+
> >     | 64    | 12190                  |
> >     +-------+------------------------+
> >     All "dirty memory time" have been reduced by more than 60% when the
> >     number of vCPU grows.
>
> How does that translate to the original problem statement with your
> live migration test?
Based on the solution, the test results from the Google internal live
migration test also shows more than 60% improvement with >99% for 30s,
>50% for 58s and >10% for 76s.
Will add this info in to future posts.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

Thanks,
Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection
  2022-01-11 10:23     ` Marc Zyngier
@ 2022-01-11 22:12       ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

On Tue, Jan 11, 2022 at 2:23 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:39 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > To reduce the contentions caused by MMU lock, some MMU operations can
> > be performed under read lock.
> > One improvement is to add a fast path for permission relaxation during
> > dirty logging under the read lock.
>
> This commit message really doesn't say what this patch does
> (converting our MMU spinlock to a rwlock, and replacing all instances
> of the lock being acquired with a write lock acquisition). Crucially,
> it only mention the read lock which appears *nowhere* in this patch.
>
Good point. Will use the below message instead for future posts.
"Replace MMU spinlock with rwlock and update all instances of the lock
being acquired with a write lock acquisition.
Future commit will add a fast path for permission relaxation during
dirty logging under a read lock."
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Thanks,
Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection
@ 2022-01-11 22:12       ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Tue, Jan 11, 2022 at 2:23 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:39 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > To reduce the contentions caused by MMU lock, some MMU operations can
> > be performed under read lock.
> > One improvement is to add a fast path for permission relaxation during
> > dirty logging under the read lock.
>
> This commit message really doesn't say what this patch does
> (converting our MMU spinlock to a rwlock, and replacing all instances
> of the lock being acquired with a write lock acquisition). Crucially,
> it only mention the read lock which appears *nowhere* in this patch.
>
Good point. Will use the below message instead for future posts.
"Replace MMU spinlock with rwlock and update all instances of the lock
being acquired with a write lock acquisition.
Future commit will add a fast path for permission relaxation during
dirty logging under a read lock."
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Thanks,
Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
  2022-01-11 10:50     ` Marc Zyngier
@ 2022-01-11 22:12       ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe

Hi Marc,

On Tue, Jan 11, 2022 at 2:50 AM Marc Zyngier <maz@kernel.org> wrote:
>
> Coming back to this, as it does bother me.
>
> On Mon, 10 Jan 2022 21:04:40 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > To reduce MMU lock contention during dirty logging, all permission
> > relaxation operations would be performed under read lock.
> >
> > Signed-off-by: Jing Zhang <jingzhangos@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index cafd5813c949..dd1f43fba4b0 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
> >       return 0;
> >  }
> >
> > +static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +             struct kvm_memory_slot *memslot, unsigned long fault_status)
> > +{
> > +     int ret;
> > +     bool writable;
> > +     bool write_fault = kvm_is_write_fault(vcpu);
> > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +     kvm_pfn_t pfn;
> > +     struct kvm *kvm = vcpu->kvm;
> > +     bool logging_active = memslot_is_logging(memslot);
> > +     unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> > +     unsigned long fault_granule;
> > +
> > +     fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> > +
> > +     /* Make sure the fault can be handled in the fast path.
> > +      * Only handle write permission fault on non-hugepage during dirty
> > +      * logging period.
> > +      */
> > +     if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
> > +                     || !logging_active || !write_fault)
> > +             return false;
> > +
> > +
> > +     /* Pin the pfn to make sure it couldn't be freed and be resued for
> > +      * another gfn.
> > +      */
> > +     pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
> > +                                write_fault, &writable, NULL);
>
> Why the requirement to be atomic? Once this returns, the page will
> have an elevated refcount, atomic or not. Given that we're not in an
> environment that requires atomicity (we're fully preemptible at this
> stage), I wonder what this is achieving.
>
> > +     if (is_error_pfn(pfn) || !writable)
> > +             return false;
> > +
> > +     read_lock(&kvm->mmu_lock);
>
> You also dropped the hazarding against a concurrent MMU notifier. Why
> is it safe to do so?
>
> > +     ret = kvm_pgtable_stage2_relax_perms(
> > +                     vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);
> > +
> > +     if (!ret) {
> > +             kvm_set_pfn_dirty(pfn);
> > +             mark_page_dirty_in_slot(kvm, memslot, gfn);
> > +     }
> > +     read_unlock(&kvm->mmu_lock);
> > +
> > +     kvm_set_pfn_accessed(pfn);
> > +     kvm_release_pfn_clean(pfn);
> > +
> > +     return true;
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                         struct kvm_memory_slot *memslot, unsigned long hva,
> >                         unsigned long fault_status)
> > @@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> >       struct kvm_pgtable *pgt;
> >
> > +     if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
> > +             return 0;
> >       fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> >       write_fault = kvm_is_write_fault(vcpu);
> >       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Appreciate all the comments here. I'll refactor the patch to implement
the fast path in user_mem_abort and address all the problems you
mentioned.
Thanks,
Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging
@ 2022-01-11 22:12       ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

Hi Marc,

On Tue, Jan 11, 2022 at 2:50 AM Marc Zyngier <maz@kernel.org> wrote:
>
> Coming back to this, as it does bother me.
>
> On Mon, 10 Jan 2022 21:04:40 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > To reduce MMU lock contention during dirty logging, all permission
> > relaxation operations would be performed under read lock.
> >
> > Signed-off-by: Jing Zhang <jingzhangos@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 50 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 50 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index cafd5813c949..dd1f43fba4b0 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1063,6 +1063,54 @@ static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
> >       return 0;
> >  }
> >
> > +static bool fast_mark_writable(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +             struct kvm_memory_slot *memslot, unsigned long fault_status)
> > +{
> > +     int ret;
> > +     bool writable;
> > +     bool write_fault = kvm_is_write_fault(vcpu);
> > +     gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> > +     kvm_pfn_t pfn;
> > +     struct kvm *kvm = vcpu->kvm;
> > +     bool logging_active = memslot_is_logging(memslot);
> > +     unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
> > +     unsigned long fault_granule;
> > +
> > +     fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> > +
> > +     /* Make sure the fault can be handled in the fast path.
> > +      * Only handle write permission fault on non-hugepage during dirty
> > +      * logging period.
> > +      */
> > +     if (fault_status != FSC_PERM || fault_granule != PAGE_SIZE
> > +                     || !logging_active || !write_fault)
> > +             return false;
> > +
> > +
> > +     /* Pin the pfn to make sure it couldn't be freed and be resued for
> > +      * another gfn.
> > +      */
> > +     pfn = __gfn_to_pfn_memslot(memslot, gfn, true, NULL,
> > +                                write_fault, &writable, NULL);
>
> Why the requirement to be atomic? Once this returns, the page will
> have an elevated refcount, atomic or not. Given that we're not in an
> environment that requires atomicity (we're fully preemptible at this
> stage), I wonder what this is achieving.
>
> > +     if (is_error_pfn(pfn) || !writable)
> > +             return false;
> > +
> > +     read_lock(&kvm->mmu_lock);
>
> You also dropped the hazarding against a concurrent MMU notifier. Why
> is it safe to do so?
>
> > +     ret = kvm_pgtable_stage2_relax_perms(
> > +                     vcpu->arch.hw_mmu->pgt, fault_ipa, PAGE_HYP);
> > +
> > +     if (!ret) {
> > +             kvm_set_pfn_dirty(pfn);
> > +             mark_page_dirty_in_slot(kvm, memslot, gfn);
> > +     }
> > +     read_unlock(&kvm->mmu_lock);
> > +
> > +     kvm_set_pfn_accessed(pfn);
> > +     kvm_release_pfn_clean(pfn);
> > +
> > +     return true;
> > +}
> > +
> >  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >                         struct kvm_memory_slot *memslot, unsigned long hva,
> >                         unsigned long fault_status)
> > @@ -1085,6 +1133,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> >       struct kvm_pgtable *pgt;
> >
> > +     if (fast_mark_writable(vcpu, fault_ipa, memslot, fault_status))
> > +             return 0;
> >       fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> >       write_fault = kvm_is_write_fault(vcpu);
> >       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Appreciate all the comments here. I'll refactor the patch to implement
the fast path in user_mem_abort and address all the problems you
mentioned.
Thanks,
Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-11  9:55     ` Andrew Jones
@ 2022-01-11 22:12       ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Andrew Jones
  Cc: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

On Tue, Jan 11, 2022 at 1:55 AM Andrew Jones <drjones@redhat.com> wrote:
>
> On Mon, Jan 10, 2022 at 09:04:41PM +0000, Jing Zhang wrote:
> > For ARM64, if no vgic is setup before the dirty log perf test, the
> > userspace irqchip would be used, which would affect the dirty log perf
> > test result.
> >
> > Signed-off-by: Jing Zhang <jingzhangos@google.com>
> > ---
> >  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > index 1954b964d1cf..b501338d9430 100644
> > --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > @@ -18,6 +18,12 @@
> >  #include "test_util.h"
> >  #include "perf_test_util.h"
> >  #include "guest_modes.h"
> > +#ifdef __aarch64__
> > +#include "aarch64/vgic.h"
> > +
> > +#define GICD_BASE_GPA                        0x8000000ULL
> > +#define GICR_BASE_GPA                        0x80A0000ULL
> > +#endif
> >
> >  /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
> >  #define TEST_HOST_LOOP_N             2UL
> > @@ -200,6 +206,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >               vm_enable_cap(vm, &cap);
> >       }
> >
> > +#ifdef __aarch64__
> > +     vgic_v3_setup(vm, nr_vcpus, 64, GICD_BASE_GPA, GICR_BASE_GPA);
>                                     ^^ extra parameter
The patch is based on kvm/queue, which has a patch adding an extra
parameter nr_irqs.

>
> Thanks,
> drew
>
> > +#endif
> > +
> >       /* Start the iterations */
> >       iteration = 0;
> >       host_quit = false;
> > --
> > 2.34.1.575.g55b058a8bb-goog
> >
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> >
>

Thanks,
Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-11 22:12       ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:12 UTC (permalink / raw)
  To: Andrew Jones
  Cc: KVM, Marc Zyngier, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Tue, Jan 11, 2022 at 1:55 AM Andrew Jones <drjones@redhat.com> wrote:
>
> On Mon, Jan 10, 2022 at 09:04:41PM +0000, Jing Zhang wrote:
> > For ARM64, if no vgic is setup before the dirty log perf test, the
> > userspace irqchip would be used, which would affect the dirty log perf
> > test result.
> >
> > Signed-off-by: Jing Zhang <jingzhangos@google.com>
> > ---
> >  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > index 1954b964d1cf..b501338d9430 100644
> > --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > @@ -18,6 +18,12 @@
> >  #include "test_util.h"
> >  #include "perf_test_util.h"
> >  #include "guest_modes.h"
> > +#ifdef __aarch64__
> > +#include "aarch64/vgic.h"
> > +
> > +#define GICD_BASE_GPA                        0x8000000ULL
> > +#define GICR_BASE_GPA                        0x80A0000ULL
> > +#endif
> >
> >  /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
> >  #define TEST_HOST_LOOP_N             2UL
> > @@ -200,6 +206,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
> >               vm_enable_cap(vm, &cap);
> >       }
> >
> > +#ifdef __aarch64__
> > +     vgic_v3_setup(vm, nr_vcpus, 64, GICD_BASE_GPA, GICR_BASE_GPA);
>                                     ^^ extra parameter
The patch is based on kvm/queue, which has a patch adding an extra
parameter nr_irqs.

>
> Thanks,
> drew
>
> > +#endif
> > +
> >       /* Start the iterations */
> >       iteration = 0;
> >       host_quit = false;
> > --
> > 2.34.1.575.g55b058a8bb-goog
> >
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> >
>

Thanks,
Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-11 10:30     ` Marc Zyngier
@ 2022-01-11 22:16       ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe, Raghavendra Rao Ananta

On Tue, Jan 11, 2022 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:41 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > For ARM64, if no vgic is setup before the dirty log perf test, the
> > userspace irqchip would be used, which would affect the dirty log perf
> > test result.
>
> Doesn't it affect *all* performance tests? How much does this change
> contributes to the performance numbers you give in the cover letter?
>
This bottleneck showed up after adding the fast path patch. I didn't
try other performance tests with this, but I think it is a good idea
to add a vgic setup for all performance tests. I can post another
patch later to make it available for all performance tests after
finishing this one and verifying all other performance tests.
Below is the test result without adding the vgic setup. It shows
20~30% improvement for the different number of vCPUs.
+-------+------------------------+
    | #vCPU | dirty memory time (ms) |
    +-------+------------------------+
    | 1     | 965                    |
    +-------+------------------------+
    | 2     | 1006                    |
    +-------+------------------------+
    | 4     | 1128                    |
    +-------+------------------------+
    | 8     | 2005                   |
    +-------+------------------------+
    | 16    | 3903                   |
    +-------+------------------------+
    | 32    | 7595                   |
    +-------+------------------------+
    | 64    | 15783                  |
    +-------+------------------------+
> >
> > Signed-off-by: Jing Zhang <jingzhangos@google.com>
> > ---
> >  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > index 1954b964d1cf..b501338d9430 100644
> > --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > @@ -18,6 +18,12 @@
> >  #include "test_util.h"
> >  #include "perf_test_util.h"
> >  #include "guest_modes.h"
> > +#ifdef __aarch64__
> > +#include "aarch64/vgic.h"
> > +
> > +#define GICD_BASE_GPA                        0x8000000ULL
> > +#define GICR_BASE_GPA                        0x80A0000ULL
>
> How did you pick these values?
I used the same values from other tests.
Talked with Raghavendra about the values. It could be arbitrary and he
chose these values from QEMU's configuration.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Thanks,
Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-11 22:16       ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-11 22:16 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Tue, Jan 11, 2022 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 10 Jan 2022 21:04:41 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > For ARM64, if no vgic is setup before the dirty log perf test, the
> > userspace irqchip would be used, which would affect the dirty log perf
> > test result.
>
> Doesn't it affect *all* performance tests? How much does this change
> contributes to the performance numbers you give in the cover letter?
>
This bottleneck showed up after adding the fast path patch. I didn't
try other performance tests with this, but I think it is a good idea
to add a vgic setup for all performance tests. I can post another
patch later to make it available for all performance tests after
finishing this one and verifying all other performance tests.
Below is the test result without adding the vgic setup. It shows
20~30% improvement for the different number of vCPUs.
+-------+------------------------+
    | #vCPU | dirty memory time (ms) |
    +-------+------------------------+
    | 1     | 965                    |
    +-------+------------------------+
    | 2     | 1006                    |
    +-------+------------------------+
    | 4     | 1128                    |
    +-------+------------------------+
    | 8     | 2005                   |
    +-------+------------------------+
    | 16    | 3903                   |
    +-------+------------------------+
    | 32    | 7595                   |
    +-------+------------------------+
    | 64    | 15783                  |
    +-------+------------------------+
> >
> > Signed-off-by: Jing Zhang <jingzhangos@google.com>
> > ---
> >  tools/testing/selftests/kvm/dirty_log_perf_test.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > index 1954b964d1cf..b501338d9430 100644
> > --- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > +++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
> > @@ -18,6 +18,12 @@
> >  #include "test_util.h"
> >  #include "perf_test_util.h"
> >  #include "guest_modes.h"
> > +#ifdef __aarch64__
> > +#include "aarch64/vgic.h"
> > +
> > +#define GICD_BASE_GPA                        0x8000000ULL
> > +#define GICR_BASE_GPA                        0x80A0000ULL
>
> How did you pick these values?
I used the same values from other tests.
Talked with Raghavendra about the values. It could be arbitrary and he
chose these values from QEMU's configuration.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
Thanks,
Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-11 22:16       ` Jing Zhang
@ 2022-01-12 11:37         ` Marc Zyngier
  -1 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-12 11:37 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe, Raghavendra Rao Ananta

On Tue, 11 Jan 2022 22:16:01 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> On Tue, Jan 11, 2022 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 10 Jan 2022 21:04:41 +0000,
> > Jing Zhang <jingzhangos@google.com> wrote:
> > >
> > > For ARM64, if no vgic is setup before the dirty log perf test, the
> > > userspace irqchip would be used, which would affect the dirty log perf
> > > test result.
> >
> > Doesn't it affect *all* performance tests? How much does this change
> > contributes to the performance numbers you give in the cover letter?
> >
> This bottleneck showed up after adding the fast path patch. I didn't
> try other performance tests with this, but I think it is a good idea
> to add a vgic setup for all performance tests. I can post another
> patch later to make it available for all performance tests after
> finishing this one and verifying all other performance tests.
> Below is the test result without adding the vgic setup. It shows
> 20~30% improvement for the different number of vCPUs.
> +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 965                    |
>     +-------+------------------------+
>     | 2     | 1006                    |
>     +-------+------------------------+
>     | 4     | 1128                    |
>     +-------+------------------------+
>     | 8     | 2005                   |
>     +-------+------------------------+
>     | 16    | 3903                   |
>     +-------+------------------------+
>     | 32    | 7595                   |
>     +-------+------------------------+
>     | 64    | 15783                  |
>     +-------+------------------------+

So please use these numbers in your cover letter when you repost your
series, as the improvement you'd observe on actual workloads is likely
to be less than what you claim due to this change in the test itself
(in other words, if you are going to benchamark something, don't
change the benchmark halfway).

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-12 11:37         ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2022-01-12 11:37 UTC (permalink / raw)
  To: Jing Zhang; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Tue, 11 Jan 2022 22:16:01 +0000,
Jing Zhang <jingzhangos@google.com> wrote:
> 
> On Tue, Jan 11, 2022 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Mon, 10 Jan 2022 21:04:41 +0000,
> > Jing Zhang <jingzhangos@google.com> wrote:
> > >
> > > For ARM64, if no vgic is setup before the dirty log perf test, the
> > > userspace irqchip would be used, which would affect the dirty log perf
> > > test result.
> >
> > Doesn't it affect *all* performance tests? How much does this change
> > contributes to the performance numbers you give in the cover letter?
> >
> This bottleneck showed up after adding the fast path patch. I didn't
> try other performance tests with this, but I think it is a good idea
> to add a vgic setup for all performance tests. I can post another
> patch later to make it available for all performance tests after
> finishing this one and verifying all other performance tests.
> Below is the test result without adding the vgic setup. It shows
> 20~30% improvement for the different number of vCPUs.
> +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 965                    |
>     +-------+------------------------+
>     | 2     | 1006                    |
>     +-------+------------------------+
>     | 4     | 1128                    |
>     +-------+------------------------+
>     | 8     | 2005                   |
>     +-------+------------------------+
>     | 16    | 3903                   |
>     +-------+------------------------+
>     | 32    | 7595                   |
>     +-------+------------------------+
>     | 64    | 15783                  |
>     +-------+------------------------+

So please use these numbers in your cover letter when you repost your
series, as the improvement you'd observe on actual workloads is likely
to be less than what you claim due to this change in the test itself
(in other words, if you are going to benchamark something, don't
change the benchmark halfway).

	M.

-- 
Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
  2022-01-12 11:37         ` Marc Zyngier
@ 2022-01-12 17:40           ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-12 17:40 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: KVM, KVMARM, Will Deacon, Paolo Bonzini, David Matlack,
	Oliver Upton, Reiji Watanabe, Raghavendra Rao Ananta

On Wed, Jan 12, 2022 at 3:37 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Tue, 11 Jan 2022 22:16:01 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > On Tue, Jan 11, 2022 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Mon, 10 Jan 2022 21:04:41 +0000,
> > > Jing Zhang <jingzhangos@google.com> wrote:
> > > >
> > > > For ARM64, if no vgic is setup before the dirty log perf test, the
> > > > userspace irqchip would be used, which would affect the dirty log perf
> > > > test result.
> > >
> > > Doesn't it affect *all* performance tests? How much does this change
> > > contributes to the performance numbers you give in the cover letter?
> > >
> > This bottleneck showed up after adding the fast path patch. I didn't
> > try other performance tests with this, but I think it is a good idea
> > to add a vgic setup for all performance tests. I can post another
> > patch later to make it available for all performance tests after
> > finishing this one and verifying all other performance tests.
> > Below is the test result without adding the vgic setup. It shows
> > 20~30% improvement for the different number of vCPUs.
> > +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 965                    |
> >     +-------+------------------------+
> >     | 2     | 1006                    |
> >     +-------+------------------------+
> >     | 4     | 1128                    |
> >     +-------+------------------------+
> >     | 8     | 2005                   |
> >     +-------+------------------------+
> >     | 16    | 3903                   |
> >     +-------+------------------------+
> >     | 32    | 7595                   |
> >     +-------+------------------------+
> >     | 64    | 15783                  |
> >     +-------+------------------------+
>
> So please use these numbers in your cover letter when you repost your
> series, as the improvement you'd observe on actual workloads is likely
> to be less than what you claim due to this change in the test itself
> (in other words, if you are going to benchamark something, don't
> change the benchmark halfway).
Sure. Will clarify this in the cover letter in future posts.
Thanks,
Jing
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM
@ 2022-01-12 17:40           ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-12 17:40 UTC (permalink / raw)
  To: Marc Zyngier; +Cc: KVM, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Wed, Jan 12, 2022 at 3:37 AM Marc Zyngier <maz@kernel.org> wrote:
>
> On Tue, 11 Jan 2022 22:16:01 +0000,
> Jing Zhang <jingzhangos@google.com> wrote:
> >
> > On Tue, Jan 11, 2022 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Mon, 10 Jan 2022 21:04:41 +0000,
> > > Jing Zhang <jingzhangos@google.com> wrote:
> > > >
> > > > For ARM64, if no vgic is setup before the dirty log perf test, the
> > > > userspace irqchip would be used, which would affect the dirty log perf
> > > > test result.
> > >
> > > Doesn't it affect *all* performance tests? How much does this change
> > > contributes to the performance numbers you give in the cover letter?
> > >
> > This bottleneck showed up after adding the fast path patch. I didn't
> > try other performance tests with this, but I think it is a good idea
> > to add a vgic setup for all performance tests. I can post another
> > patch later to make it available for all performance tests after
> > finishing this one and verifying all other performance tests.
> > Below is the test result without adding the vgic setup. It shows
> > 20~30% improvement for the different number of vCPUs.
> > +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 965                    |
> >     +-------+------------------------+
> >     | 2     | 1006                    |
> >     +-------+------------------------+
> >     | 4     | 1128                    |
> >     +-------+------------------------+
> >     | 8     | 2005                   |
> >     +-------+------------------------+
> >     | 16    | 3903                   |
> >     +-------+------------------------+
> >     | 32    | 7595                   |
> >     +-------+------------------------+
> >     | 64    | 15783                  |
> >     +-------+------------------------+
>
> So please use these numbers in your cover letter when you repost your
> series, as the improvement you'd observe on actual workloads is likely
> to be less than what you claim due to this change in the test itself
> (in other words, if you are going to benchamark something, don't
> change the benchmark halfway).
Sure. Will clarify this in the cover letter in future posts.
Thanks,
Jing
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
  2022-01-10 21:04 ` Jing Zhang
@ 2022-01-13  2:49   ` Ricardo Koller
  -1 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2022-01-13  2:49 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

Hi Jing,

On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> This patch is to reduce the performance degradation of guest workload during
> dirty logging on ARM64. A fast path is added to handle permission relaxation
> during dirty logging. The MMU lock is replaced with rwlock, by which all
> permision relaxations on leaf pte can be performed under the read lock. This
> greatly reduces the MMU lock contention during dirty logging. With this
> solution, the source guest workload performance degradation can be improved
> by more than 60%.
> 
> Problem:
>   * A Google internal live migration test shows that the source guest workload
>   performance has >99% degradation for about 105 seconds, >50% degradation
>   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
>   This shows that most of the time, the guest workload degradtion is above
>   99%, which obviously needs some improvement compared to the test result
>   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
>   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
>   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> 
> Analysis:
>   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
>     the number of contentions of MMU lock and the "dirty memory time" on
>     various VM spec.
>     By using test command
>     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
>     Below are the results:
>     +-------+------------------------+-----------------------+
>     | #vCPU | dirty memory time (ms) | number of contentions |
>     +-------+------------------------+-----------------------+
>     | 1     | 926                    | 0                     |
>     +-------+------------------------+-----------------------+
>     | 2     | 1189                   | 4732558               |
>     +-------+------------------------+-----------------------+
>     | 4     | 2503                   | 11527185              |
>     +-------+------------------------+-----------------------+
>     | 8     | 5069                   | 24881677              |
>     +-------+------------------------+-----------------------+
>     | 16    | 10340                  | 50347956              |
>     +-------+------------------------+-----------------------+
>     | 32    | 20351                  | 100605720             |
>     +-------+------------------------+-----------------------+
>     | 64    | 40994                  | 201442478             |
>     +-------+------------------------+-----------------------+
> 
>   * From the test results above, the "dirty memory time" and the number of
>     MMU lock contention scale with the number of vCPUs. That means all the
>     dirty memory operations from all vCPU threads have been serialized by
>     the MMU lock. Further analysis also shows that the permission relaxation
>     during dirty logging is where vCPU threads get serialized.
> 
> Solution:
>   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
>     the dirty-bit solution for dirty logging is much complicated compared to
>     the write-protection solution. The straight way to reduce the guest
>     performance degradation is to enhance the concurrency for the permission
>     fault path during dirty logging.
>   * In this patch, we only put leaf PTE permission relaxation for dirty
>     logging under read lock, all others would go under write lock.
>     Below are the results based on the solution:
>     +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 803                    |
>     +-------+------------------------+
>     | 2     | 843                    |
>     +-------+------------------------+
>     | 4     | 942                    |
>     +-------+------------------------+
>     | 8     | 1458                   |
>     +-------+------------------------+
>     | 16    | 2853                   |
>     +-------+------------------------+
>     | 32    | 5886                   |
>     +-------+------------------------+
>     | 64    | 12190                  |
>     +-------+------------------------+

Just curious, do yo know why is time still doubling (roughly) with the
number of cpus? maybe you performed another experiment or have some
guess(es).

Thanks,
Ricardo

>     All "dirty memory time" have been reduced by more than 60% when the
>     number of vCPU grows.
>     
> ---
> 
> Jing Zhang (3):
>   KVM: arm64: Use read/write spin lock for MMU protection
>   KVM: arm64: Add fast path to handle permission relaxation during dirty
>     logging
>   KVM: selftests: Add vgic initialization for dirty log perf test for
>     ARM
> 
>  arch/arm64/include/asm/kvm_host.h             |  2 +
>  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
>  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
>  3 files changed, 80 insertions(+), 18 deletions(-)
> 
> 
> base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> -- 
> 2.34.1.575.g55b058a8bb-goog
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-13  2:49   ` Ricardo Koller
  0 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2022-01-13  2:49 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, Marc Zyngier, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

Hi Jing,

On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> This patch is to reduce the performance degradation of guest workload during
> dirty logging on ARM64. A fast path is added to handle permission relaxation
> during dirty logging. The MMU lock is replaced with rwlock, by which all
> permision relaxations on leaf pte can be performed under the read lock. This
> greatly reduces the MMU lock contention during dirty logging. With this
> solution, the source guest workload performance degradation can be improved
> by more than 60%.
> 
> Problem:
>   * A Google internal live migration test shows that the source guest workload
>   performance has >99% degradation for about 105 seconds, >50% degradation
>   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
>   This shows that most of the time, the guest workload degradtion is above
>   99%, which obviously needs some improvement compared to the test result
>   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
>   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
>   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> 
> Analysis:
>   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
>     the number of contentions of MMU lock and the "dirty memory time" on
>     various VM spec.
>     By using test command
>     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
>     Below are the results:
>     +-------+------------------------+-----------------------+
>     | #vCPU | dirty memory time (ms) | number of contentions |
>     +-------+------------------------+-----------------------+
>     | 1     | 926                    | 0                     |
>     +-------+------------------------+-----------------------+
>     | 2     | 1189                   | 4732558               |
>     +-------+------------------------+-----------------------+
>     | 4     | 2503                   | 11527185              |
>     +-------+------------------------+-----------------------+
>     | 8     | 5069                   | 24881677              |
>     +-------+------------------------+-----------------------+
>     | 16    | 10340                  | 50347956              |
>     +-------+------------------------+-----------------------+
>     | 32    | 20351                  | 100605720             |
>     +-------+------------------------+-----------------------+
>     | 64    | 40994                  | 201442478             |
>     +-------+------------------------+-----------------------+
> 
>   * From the test results above, the "dirty memory time" and the number of
>     MMU lock contention scale with the number of vCPUs. That means all the
>     dirty memory operations from all vCPU threads have been serialized by
>     the MMU lock. Further analysis also shows that the permission relaxation
>     during dirty logging is where vCPU threads get serialized.
> 
> Solution:
>   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
>     the dirty-bit solution for dirty logging is much complicated compared to
>     the write-protection solution. The straight way to reduce the guest
>     performance degradation is to enhance the concurrency for the permission
>     fault path during dirty logging.
>   * In this patch, we only put leaf PTE permission relaxation for dirty
>     logging under read lock, all others would go under write lock.
>     Below are the results based on the solution:
>     +-------+------------------------+
>     | #vCPU | dirty memory time (ms) |
>     +-------+------------------------+
>     | 1     | 803                    |
>     +-------+------------------------+
>     | 2     | 843                    |
>     +-------+------------------------+
>     | 4     | 942                    |
>     +-------+------------------------+
>     | 8     | 1458                   |
>     +-------+------------------------+
>     | 16    | 2853                   |
>     +-------+------------------------+
>     | 32    | 5886                   |
>     +-------+------------------------+
>     | 64    | 12190                  |
>     +-------+------------------------+

Just curious, do yo know why is time still doubling (roughly) with the
number of cpus? maybe you performed another experiment or have some
guess(es).

Thanks,
Ricardo

>     All "dirty memory time" have been reduced by more than 60% when the
>     number of vCPU grows.
>     
> ---
> 
> Jing Zhang (3):
>   KVM: arm64: Use read/write spin lock for MMU protection
>   KVM: arm64: Add fast path to handle permission relaxation during dirty
>     logging
>   KVM: selftests: Add vgic initialization for dirty log perf test for
>     ARM
> 
>  arch/arm64/include/asm/kvm_host.h             |  2 +
>  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
>  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
>  3 files changed, 80 insertions(+), 18 deletions(-)
> 
> 
> base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> -- 
> 2.34.1.575.g55b058a8bb-goog
> 
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
  2022-01-13  2:49   ` Ricardo Koller
@ 2022-01-13  3:50     ` Jing Zhang
  -1 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-13  3:50 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Hi Jing,
>
> On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > This patch is to reduce the performance degradation of guest workload during
> > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > permision relaxations on leaf pte can be performed under the read lock. This
> > greatly reduces the MMU lock contention during dirty logging. With this
> > solution, the source guest workload performance degradation can be improved
> > by more than 60%.
> >
> > Problem:
> >   * A Google internal live migration test shows that the source guest workload
> >   performance has >99% degradation for about 105 seconds, >50% degradation
> >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> >   This shows that most of the time, the guest workload degradtion is above
> >   99%, which obviously needs some improvement compared to the test result
> >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> >
> > Analysis:
> >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> >     the number of contentions of MMU lock and the "dirty memory time" on
> >     various VM spec.
> >     By using test command
> >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> >     Below are the results:
> >     +-------+------------------------+-----------------------+
> >     | #vCPU | dirty memory time (ms) | number of contentions |
> >     +-------+------------------------+-----------------------+
> >     | 1     | 926                    | 0                     |
> >     +-------+------------------------+-----------------------+
> >     | 2     | 1189                   | 4732558               |
> >     +-------+------------------------+-----------------------+
> >     | 4     | 2503                   | 11527185              |
> >     +-------+------------------------+-----------------------+
> >     | 8     | 5069                   | 24881677              |
> >     +-------+------------------------+-----------------------+
> >     | 16    | 10340                  | 50347956              |
> >     +-------+------------------------+-----------------------+
> >     | 32    | 20351                  | 100605720             |
> >     +-------+------------------------+-----------------------+
> >     | 64    | 40994                  | 201442478             |
> >     +-------+------------------------+-----------------------+
> >
> >   * From the test results above, the "dirty memory time" and the number of
> >     MMU lock contention scale with the number of vCPUs. That means all the
> >     dirty memory operations from all vCPU threads have been serialized by
> >     the MMU lock. Further analysis also shows that the permission relaxation
> >     during dirty logging is where vCPU threads get serialized.
> >
> > Solution:
> >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> >     the dirty-bit solution for dirty logging is much complicated compared to
> >     the write-protection solution. The straight way to reduce the guest
> >     performance degradation is to enhance the concurrency for the permission
> >     fault path during dirty logging.
> >   * In this patch, we only put leaf PTE permission relaxation for dirty
> >     logging under read lock, all others would go under write lock.
> >     Below are the results based on the solution:
> >     +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 803                    |
> >     +-------+------------------------+
> >     | 2     | 843                    |
> >     +-------+------------------------+
> >     | 4     | 942                    |
> >     +-------+------------------------+
> >     | 8     | 1458                   |
> >     +-------+------------------------+
> >     | 16    | 2853                   |
> >     +-------+------------------------+
> >     | 32    | 5886                   |
> >     +-------+------------------------+
> >     | 64    | 12190                  |
> >     +-------+------------------------+
>
> Just curious, do yo know why is time still doubling (roughly) with the
> number of cpus? maybe you performed another experiment or have some
> guess(es).
Yes. it is from the serialization caused by TLB flush whenever the
permission is relaxed. I tried test by removing the TLB flushes (of
course it shouldn't be removed), the time would be close to a constant
no matter the number of vCPUs.
>
> Thanks,
> Ricardo
>
> >     All "dirty memory time" have been reduced by more than 60% when the
> >     number of vCPU grows.
> >
> > ---
> >
> > Jing Zhang (3):
> >   KVM: arm64: Use read/write spin lock for MMU protection
> >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> >     logging
> >   KVM: selftests: Add vgic initialization for dirty log perf test for
> >     ARM
> >
> >  arch/arm64/include/asm/kvm_host.h             |  2 +
> >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> >  3 files changed, 80 insertions(+), 18 deletions(-)
> >
> >
> > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > --
> > 2.34.1.575.g55b058a8bb-goog
> >
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Thanks,
Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-13  3:50     ` Jing Zhang
  0 siblings, 0 replies; 40+ messages in thread
From: Jing Zhang @ 2022-01-13  3:50 UTC (permalink / raw)
  To: Ricardo Koller
  Cc: KVM, Marc Zyngier, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote:
>
> Hi Jing,
>
> On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > This patch is to reduce the performance degradation of guest workload during
> > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > permision relaxations on leaf pte can be performed under the read lock. This
> > greatly reduces the MMU lock contention during dirty logging. With this
> > solution, the source guest workload performance degradation can be improved
> > by more than 60%.
> >
> > Problem:
> >   * A Google internal live migration test shows that the source guest workload
> >   performance has >99% degradation for about 105 seconds, >50% degradation
> >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> >   This shows that most of the time, the guest workload degradtion is above
> >   99%, which obviously needs some improvement compared to the test result
> >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> >
> > Analysis:
> >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> >     the number of contentions of MMU lock and the "dirty memory time" on
> >     various VM spec.
> >     By using test command
> >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> >     Below are the results:
> >     +-------+------------------------+-----------------------+
> >     | #vCPU | dirty memory time (ms) | number of contentions |
> >     +-------+------------------------+-----------------------+
> >     | 1     | 926                    | 0                     |
> >     +-------+------------------------+-----------------------+
> >     | 2     | 1189                   | 4732558               |
> >     +-------+------------------------+-----------------------+
> >     | 4     | 2503                   | 11527185              |
> >     +-------+------------------------+-----------------------+
> >     | 8     | 5069                   | 24881677              |
> >     +-------+------------------------+-----------------------+
> >     | 16    | 10340                  | 50347956              |
> >     +-------+------------------------+-----------------------+
> >     | 32    | 20351                  | 100605720             |
> >     +-------+------------------------+-----------------------+
> >     | 64    | 40994                  | 201442478             |
> >     +-------+------------------------+-----------------------+
> >
> >   * From the test results above, the "dirty memory time" and the number of
> >     MMU lock contention scale with the number of vCPUs. That means all the
> >     dirty memory operations from all vCPU threads have been serialized by
> >     the MMU lock. Further analysis also shows that the permission relaxation
> >     during dirty logging is where vCPU threads get serialized.
> >
> > Solution:
> >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> >     the dirty-bit solution for dirty logging is much complicated compared to
> >     the write-protection solution. The straight way to reduce the guest
> >     performance degradation is to enhance the concurrency for the permission
> >     fault path during dirty logging.
> >   * In this patch, we only put leaf PTE permission relaxation for dirty
> >     logging under read lock, all others would go under write lock.
> >     Below are the results based on the solution:
> >     +-------+------------------------+
> >     | #vCPU | dirty memory time (ms) |
> >     +-------+------------------------+
> >     | 1     | 803                    |
> >     +-------+------------------------+
> >     | 2     | 843                    |
> >     +-------+------------------------+
> >     | 4     | 942                    |
> >     +-------+------------------------+
> >     | 8     | 1458                   |
> >     +-------+------------------------+
> >     | 16    | 2853                   |
> >     +-------+------------------------+
> >     | 32    | 5886                   |
> >     +-------+------------------------+
> >     | 64    | 12190                  |
> >     +-------+------------------------+
>
> Just curious, do yo know why is time still doubling (roughly) with the
> number of cpus? maybe you performed another experiment or have some
> guess(es).
Yes. it is from the serialization caused by TLB flush whenever the
permission is relaxed. I tried test by removing the TLB flushes (of
course it shouldn't be removed), the time would be close to a constant
no matter the number of vCPUs.
>
> Thanks,
> Ricardo
>
> >     All "dirty memory time" have been reduced by more than 60% when the
> >     number of vCPU grows.
> >
> > ---
> >
> > Jing Zhang (3):
> >   KVM: arm64: Use read/write spin lock for MMU protection
> >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> >     logging
> >   KVM: selftests: Add vgic initialization for dirty log perf test for
> >     ARM
> >
> >  arch/arm64/include/asm/kvm_host.h             |  2 +
> >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> >  3 files changed, 80 insertions(+), 18 deletions(-)
> >
> >
> > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > --
> > 2.34.1.575.g55b058a8bb-goog
> >
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Thanks,
Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
  2022-01-13  3:50     ` Jing Zhang
@ 2022-01-13  6:12       ` Ricardo Koller
  -1 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2022-01-13  6:12 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, KVMARM, Marc Zyngier, Will Deacon, Paolo Bonzini,
	David Matlack, Oliver Upton, Reiji Watanabe

On Wed, Jan 12, 2022 at 07:50:48PM -0800, Jing Zhang wrote:
> On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Hi Jing,
> >
> > On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > > This patch is to reduce the performance degradation of guest workload during
> > > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > > permision relaxations on leaf pte can be performed under the read lock. This
> > > greatly reduces the MMU lock contention during dirty logging. With this
> > > solution, the source guest workload performance degradation can be improved
> > > by more than 60%.
> > >
> > > Problem:
> > >   * A Google internal live migration test shows that the source guest workload
> > >   performance has >99% degradation for about 105 seconds, >50% degradation
> > >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> > >   This shows that most of the time, the guest workload degradtion is above
> > >   99%, which obviously needs some improvement compared to the test result
> > >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> > >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> > >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> > >
> > > Analysis:
> > >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> > >     the number of contentions of MMU lock and the "dirty memory time" on
> > >     various VM spec.
> > >     By using test command
> > >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> > >     Below are the results:
> > >     +-------+------------------------+-----------------------+
> > >     | #vCPU | dirty memory time (ms) | number of contentions |
> > >     +-------+------------------------+-----------------------+
> > >     | 1     | 926                    | 0                     |
> > >     +-------+------------------------+-----------------------+
> > >     | 2     | 1189                   | 4732558               |
> > >     +-------+------------------------+-----------------------+
> > >     | 4     | 2503                   | 11527185              |
> > >     +-------+------------------------+-----------------------+
> > >     | 8     | 5069                   | 24881677              |
> > >     +-------+------------------------+-----------------------+
> > >     | 16    | 10340                  | 50347956              |
> > >     +-------+------------------------+-----------------------+
> > >     | 32    | 20351                  | 100605720             |
> > >     +-------+------------------------+-----------------------+
> > >     | 64    | 40994                  | 201442478             |
> > >     +-------+------------------------+-----------------------+
> > >
> > >   * From the test results above, the "dirty memory time" and the number of
> > >     MMU lock contention scale with the number of vCPUs. That means all the
> > >     dirty memory operations from all vCPU threads have been serialized by
> > >     the MMU lock. Further analysis also shows that the permission relaxation
> > >     during dirty logging is where vCPU threads get serialized.
> > >
> > > Solution:
> > >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> > >     the dirty-bit solution for dirty logging is much complicated compared to
> > >     the write-protection solution. The straight way to reduce the guest
> > >     performance degradation is to enhance the concurrency for the permission
> > >     fault path during dirty logging.
> > >   * In this patch, we only put leaf PTE permission relaxation for dirty
> > >     logging under read lock, all others would go under write lock.
> > >     Below are the results based on the solution:
> > >     +-------+------------------------+
> > >     | #vCPU | dirty memory time (ms) |
> > >     +-------+------------------------+
> > >     | 1     | 803                    |
> > >     +-------+------------------------+
> > >     | 2     | 843                    |
> > >     +-------+------------------------+
> > >     | 4     | 942                    |
> > >     +-------+------------------------+
> > >     | 8     | 1458                   |
> > >     +-------+------------------------+
> > >     | 16    | 2853                   |
> > >     +-------+------------------------+
> > >     | 32    | 5886                   |
> > >     +-------+------------------------+
> > >     | 64    | 12190                  |
> > >     +-------+------------------------+
> >
> > Just curious, do yo know why is time still doubling (roughly) with the
> > number of cpus? maybe you performed another experiment or have some
> > guess(es).
> Yes. it is from the serialization caused by TLB flush whenever the
> permission is relaxed. I tried test by removing the TLB flushes (of
> course it shouldn't be removed), the time would be close to a constant
> no matter the number of vCPUs.

Got it, thanks for the info.

Ricardo

> >
> > Thanks,
> > Ricardo
> >
> > >     All "dirty memory time" have been reduced by more than 60% when the
> > >     number of vCPU grows.
> > >
> > > ---
> > >
> > > Jing Zhang (3):
> > >   KVM: arm64: Use read/write spin lock for MMU protection
> > >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> > >     logging
> > >   KVM: selftests: Add vgic initialization for dirty log perf test for
> > >     ARM
> > >
> > >  arch/arm64/include/asm/kvm_host.h             |  2 +
> > >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> > >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> > >  3 files changed, 80 insertions(+), 18 deletions(-)
> > >
> > >
> > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > > --
> > > 2.34.1.575.g55b058a8bb-goog
> > >
> > > _______________________________________________
> > > kvmarm mailing list
> > > kvmarm@lists.cs.columbia.edu
> > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> Thanks,
> Jing

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty
@ 2022-01-13  6:12       ` Ricardo Koller
  0 siblings, 0 replies; 40+ messages in thread
From: Ricardo Koller @ 2022-01-13  6:12 UTC (permalink / raw)
  To: Jing Zhang
  Cc: KVM, Marc Zyngier, David Matlack, Paolo Bonzini, Will Deacon, KVMARM

On Wed, Jan 12, 2022 at 07:50:48PM -0800, Jing Zhang wrote:
> On Wed, Jan 12, 2022 at 6:50 PM Ricardo Koller <ricarkol@google.com> wrote:
> >
> > Hi Jing,
> >
> > On Mon, Jan 10, 2022 at 09:04:38PM +0000, Jing Zhang wrote:
> > > This patch is to reduce the performance degradation of guest workload during
> > > dirty logging on ARM64. A fast path is added to handle permission relaxation
> > > during dirty logging. The MMU lock is replaced with rwlock, by which all
> > > permision relaxations on leaf pte can be performed under the read lock. This
> > > greatly reduces the MMU lock contention during dirty logging. With this
> > > solution, the source guest workload performance degradation can be improved
> > > by more than 60%.
> > >
> > > Problem:
> > >   * A Google internal live migration test shows that the source guest workload
> > >   performance has >99% degradation for about 105 seconds, >50% degradation
> > >   for about 112 seconds, >10% degradation for about 112 seconds on ARM64.
> > >   This shows that most of the time, the guest workload degradtion is above
> > >   99%, which obviously needs some improvement compared to the test result
> > >   on x86 (>99% for 6s, >50% for 9s, >10% for 27s).
> > >   * Tested H/W: Ampere Altra 3GHz, #CPU: 64, #Mem: 256GB
> > >   * VM spec: #vCPU: 48, #Mem/vCPU: 4GB
> > >
> > > Analysis:
> > >   * We enabled CONFIG_LOCK_STAT in kernel and used dirty_log_perf_test to get
> > >     the number of contentions of MMU lock and the "dirty memory time" on
> > >     various VM spec.
> > >     By using test command
> > >     ./dirty_log_perf_test -b 2G -m 2 -i 2 -s anonymous_hugetlb_2mb -v [#vCPU]
> > >     Below are the results:
> > >     +-------+------------------------+-----------------------+
> > >     | #vCPU | dirty memory time (ms) | number of contentions |
> > >     +-------+------------------------+-----------------------+
> > >     | 1     | 926                    | 0                     |
> > >     +-------+------------------------+-----------------------+
> > >     | 2     | 1189                   | 4732558               |
> > >     +-------+------------------------+-----------------------+
> > >     | 4     | 2503                   | 11527185              |
> > >     +-------+------------------------+-----------------------+
> > >     | 8     | 5069                   | 24881677              |
> > >     +-------+------------------------+-----------------------+
> > >     | 16    | 10340                  | 50347956              |
> > >     +-------+------------------------+-----------------------+
> > >     | 32    | 20351                  | 100605720             |
> > >     +-------+------------------------+-----------------------+
> > >     | 64    | 40994                  | 201442478             |
> > >     +-------+------------------------+-----------------------+
> > >
> > >   * From the test results above, the "dirty memory time" and the number of
> > >     MMU lock contention scale with the number of vCPUs. That means all the
> > >     dirty memory operations from all vCPU threads have been serialized by
> > >     the MMU lock. Further analysis also shows that the permission relaxation
> > >     during dirty logging is where vCPU threads get serialized.
> > >
> > > Solution:
> > >   * On ARM64, there is no mechanism as PML (Page Modification Logging) and
> > >     the dirty-bit solution for dirty logging is much complicated compared to
> > >     the write-protection solution. The straight way to reduce the guest
> > >     performance degradation is to enhance the concurrency for the permission
> > >     fault path during dirty logging.
> > >   * In this patch, we only put leaf PTE permission relaxation for dirty
> > >     logging under read lock, all others would go under write lock.
> > >     Below are the results based on the solution:
> > >     +-------+------------------------+
> > >     | #vCPU | dirty memory time (ms) |
> > >     +-------+------------------------+
> > >     | 1     | 803                    |
> > >     +-------+------------------------+
> > >     | 2     | 843                    |
> > >     +-------+------------------------+
> > >     | 4     | 942                    |
> > >     +-------+------------------------+
> > >     | 8     | 1458                   |
> > >     +-------+------------------------+
> > >     | 16    | 2853                   |
> > >     +-------+------------------------+
> > >     | 32    | 5886                   |
> > >     +-------+------------------------+
> > >     | 64    | 12190                  |
> > >     +-------+------------------------+
> >
> > Just curious, do yo know why is time still doubling (roughly) with the
> > number of cpus? maybe you performed another experiment or have some
> > guess(es).
> Yes. it is from the serialization caused by TLB flush whenever the
> permission is relaxed. I tried test by removing the TLB flushes (of
> course it shouldn't be removed), the time would be close to a constant
> no matter the number of vCPUs.

Got it, thanks for the info.

Ricardo

> >
> > Thanks,
> > Ricardo
> >
> > >     All "dirty memory time" have been reduced by more than 60% when the
> > >     number of vCPU grows.
> > >
> > > ---
> > >
> > > Jing Zhang (3):
> > >   KVM: arm64: Use read/write spin lock for MMU protection
> > >   KVM: arm64: Add fast path to handle permission relaxation during dirty
> > >     logging
> > >   KVM: selftests: Add vgic initialization for dirty log perf test for
> > >     ARM
> > >
> > >  arch/arm64/include/asm/kvm_host.h             |  2 +
> > >  arch/arm64/kvm/mmu.c                          | 86 +++++++++++++++----
> > >  .../selftests/kvm/dirty_log_perf_test.c       | 10 +++
> > >  3 files changed, 80 insertions(+), 18 deletions(-)
> > >
> > >
> > > base-commit: fea31d1690945e6dd6c3e89ec5591490857bc3d4
> > > --
> > > 2.34.1.575.g55b058a8bb-goog
> > >
> > > _______________________________________________
> > > kvmarm mailing list
> > > kvmarm@lists.cs.columbia.edu
> > > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
> Thanks,
> Jing
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2022-01-13  6:12 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-10 21:04 [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty Jing Zhang
2022-01-10 21:04 ` Jing Zhang
2022-01-10 21:04 ` [RFC PATCH 1/3] KVM: arm64: Use read/write spin lock for MMU protection Jing Zhang
2022-01-10 21:04   ` Jing Zhang
2022-01-11 10:23   ` Marc Zyngier
2022-01-11 10:23     ` Marc Zyngier
2022-01-11 22:12     ` Jing Zhang
2022-01-11 22:12       ` Jing Zhang
2022-01-10 21:04 ` [RFC PATCH 2/3] KVM: arm64: Add fast path to handle permission relaxation during dirty logging Jing Zhang
2022-01-10 21:04   ` Jing Zhang
2022-01-11 10:22   ` Marc Zyngier
2022-01-11 10:22     ` Marc Zyngier
2022-01-11 10:50   ` Marc Zyngier
2022-01-11 10:50     ` Marc Zyngier
2022-01-11 22:12     ` Jing Zhang
2022-01-11 22:12       ` Jing Zhang
2022-01-10 21:04 ` [RFC PATCH 3/3] KVM: selftests: Add vgic initialization for dirty log perf test for ARM Jing Zhang
2022-01-10 21:04   ` Jing Zhang
2022-01-11  9:55   ` Andrew Jones
2022-01-11  9:55     ` Andrew Jones
2022-01-11 22:12     ` Jing Zhang
2022-01-11 22:12       ` Jing Zhang
2022-01-11 10:30   ` Marc Zyngier
2022-01-11 10:30     ` Marc Zyngier
2022-01-11 22:16     ` Jing Zhang
2022-01-11 22:16       ` Jing Zhang
2022-01-12 11:37       ` Marc Zyngier
2022-01-12 11:37         ` Marc Zyngier
2022-01-12 17:40         ` Jing Zhang
2022-01-12 17:40           ` Jing Zhang
2022-01-11 11:54 ` [RFC PATCH 0/3] ARM64: Guest performance improvement during dirty Marc Zyngier
2022-01-11 11:54   ` Marc Zyngier
2022-01-11 22:12   ` Jing Zhang
2022-01-11 22:12     ` Jing Zhang
2022-01-13  2:49 ` Ricardo Koller
2022-01-13  2:49   ` Ricardo Koller
2022-01-13  3:50   ` Jing Zhang
2022-01-13  3:50     ` Jing Zhang
2022-01-13  6:12     ` Ricardo Koller
2022-01-13  6:12       ` Ricardo Koller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.