[Yes, the subject is deliberately provocative!] Having recently played with large, memory intensive, and very short lived VMs, I have realised that we spend a substantial amount of time cleaning up when a VM has terminated. It turns out that 99% of the overhead is due to the unnecessary cache cleaning that we perform when pages get unmapped from Stage-2, which only serves a single purpose: to make the page visible to a non-coherent IO subsystem when the page is being swapped out. It would make sense to drop these cache maintenance operations when userspace is either unmapping a VMA or simply exiting. Detecting the latter is easy, as KVM calls kvm_arch_flush_shadow_all(). The unmap case is harder, or at least it was until 5.2 gained the MMU notifier event, which allows subsystems to find out about the reason of an change in the page tables. And it turns out we can do even better: We can also avoid invalidating individual page mappings if we're tearing down the VM altogether, and replace it with a single TLBI VMALL, which will be much lighter on the whole system (specially if your interconnect is a bit flimsy). With the above, terminating a 64GB VM that has most of its memory mapped at S2 goes from several seconds (I've seen up to 12s) to less than a second on my D05. In general, multi-socket systems seem to benefit from this more than single socket machines (based on a non-representative sample of 4 random machines I have access to). The first patch affects most architectures, as it changes one of the core architecture hooks (in a fairly mechanical way). The rest is definitely ARM-specific. Marc Zyngier (7): KVM: Pass mmu_notifier_range down to kvm_unmap_hva_range() KVM: arm/arm64: Pass flags along Stage-2 unmapping functions KVM: arm/arm64: Condition cache maintenance on unmap with a flag KVM: arm/arm64: Condition TLB maintenance on unmap with a flag KVM: arm/arm64: Elide both CMOs and TBLIs on freeing the whole Stage-2 KVM: arm/arm64: Elide CMOs when retrying a block mapping KVM: arm/arm64: Elide CMOs when unmapping a range arch/arm/include/asm/kvm_host.h | 2 +- arch/arm64/include/asm/kvm_host.h | 2 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 6 +- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 5 +- arch/powerpc/kvm/e500_mmu_host.c | 4 +- arch/x86/include/asm/kvm_host.h | 3 +- arch/x86/kvm/mmu/mmu.c | 5 +- arch/x86/kvm/x86.c | 4 +- include/linux/kvm_host.h | 2 +- virt/kvm/arm/mmu.c | 97 +++++++++++++++++++---------- virt/kvm/kvm_main.c | 7 +-- 13 files changed, 88 insertions(+), 53 deletions(-) -- 2.20.1
kvm_unmap_hva_range() is currently passed both start and end fields from the mmu_notifier_range structure. As this struct now contains important information about the reason of the unmap (the event field), replace the start/end parameters with the range struct, and update all architectures. No functionnal change. Signed-off-by: Marc Zyngier <maz@kernel.org> --- arch/arm/include/asm/kvm_host.h | 2 +- arch/arm64/include/asm/kvm_host.h | 2 +- arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/mmu.c | 6 ++++-- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/book3s.c | 5 +++-- arch/powerpc/kvm/e500_mmu_host.c | 4 ++-- arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kvm/mmu/mmu.c | 5 +++-- arch/x86/kvm/x86.c | 4 ++-- include/linux/kvm_host.h | 2 +- virt/kvm/arm/mmu.c | 8 ++++---- virt/kvm/kvm_main.c | 7 +++---- 13 files changed, 28 insertions(+), 24 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 556cd818eccf..621c71594499 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -276,7 +276,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva_range(struct kvm *kvm, - unsigned long start, unsigned long end); + const struct mmu_notifier_range *range); int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu); diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index c61260cf63c5..dd850f5e81e3 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -441,7 +441,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva_range(struct kvm *kvm, - unsigned long start, unsigned long end); + const struct mmu_notifier_range *range); int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 41204a49cf95..0ed065870f1b 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -935,7 +935,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu, #define KVM_ARCH_WANT_MMU_NOTIFIER int kvm_unmap_hva_range(struct kvm *kvm, - unsigned long start, unsigned long end); + const struct mmu_notifier_range *range); int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index 7dad7a293eae..32ef868258b9 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -518,9 +518,11 @@ static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end, return 1; } -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) +int kvm_unmap_hva_range(struct kvm *kvm, + const struct mmu_notifier_range *range) { - handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL); + handle_hva_to_gpa(kvm, range->start, range->end, + &kvm_unmap_hva_handler, NULL); kvm_mips_callbacks->flush_shadow_all(kvm); return 0; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 0a398f2321c2..8cef585e0abe 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -58,7 +58,7 @@ #define KVM_ARCH_WANT_MMU_NOTIFIER extern int kvm_unmap_hva_range(struct kvm *kvm, - unsigned long start, unsigned long end); + const struct mmu_notifier_range *range); extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index 58a59ee998e2..a1529a0dd656 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -842,9 +842,10 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm, kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change); } -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) +int kvm_unmap_hva_range(struct kvm *kvm, + const struct mmu_notifier_range *range) { - return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end); + return kvm->arch.kvm_ops->unmap_hva_range(kvm, range->start, range->end); } int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 425d13806645..5a7211901063 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -734,10 +734,10 @@ static int kvm_unmap_hva(struct kvm *kvm, unsigned long hva) return 0; } -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) +int kvm_unmap_hva_range(struct kvm *kvm, const struct mmu_notifier_range *range) { /* kvm_unmap_hva flushes everything anyways */ - kvm_unmap_hva(kvm, start); + kvm_unmap_hva(kvm, range->start); return 0; } diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b79cd6aa4075..c479fa845d72 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1569,7 +1569,8 @@ asmlinkage void kvm_spurious_fault(void); _ASM_EXTABLE(666b, 667b) #define KVM_ARCH_WANT_MMU_NOTIFIER -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); +int kvm_unmap_hva_range(struct kvm *kvm, + const struct mmu_notifier_range *range); int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_test_age_hva(struct kvm *kvm, unsigned long hva); int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6f92b40d798c..86831be07c17 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2040,9 +2040,10 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler); } -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end) +int kvm_unmap_hva_range(struct kvm *kvm, + const struct mmu_notifier_range *range); { - return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp); + return kvm_handle_hva_range(kvm, range->start, range->end, 0, kvm_unmap_rmapp); } int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cf917139de6b..c1a238f4ee35 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7942,7 +7942,7 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu) } int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end, + const struct mmu_notifier_range *range, bool blockable) { unsigned long apic_address; @@ -7952,7 +7952,7 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, * Update it when it becomes invalid. */ apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT); - if (start <= apic_address && apic_address < end) + if (range->start <= apic_address && apic_address < range->end) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD); return 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 7ed1e2f8641e..d6e2ae2accc4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1391,7 +1391,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp, #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */ int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end, bool blockable); + const struct mmu_notifier_range *range, bool blockable); #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index 38b4c910b6c3..078e10c5650e 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -2035,14 +2035,14 @@ static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *dat return 0; } -int kvm_unmap_hva_range(struct kvm *kvm, - unsigned long start, unsigned long end) +int kvm_unmap_hva_range(struct kvm *kvm, const struct mmu_notifier_range *range) { if (!kvm->arch.pgd) return 0; - trace_kvm_unmap_hva_range(start, end); - handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL); + trace_kvm_unmap_hva_range(range->start, range->end); + handle_hva_to_gpa(kvm, range->start, range->end, + &kvm_unmap_hva_handler, NULL); return 0; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 00268290dcbd..7c3665ad1035 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -158,7 +158,7 @@ static unsigned long long kvm_createvm_count; static unsigned long long kvm_active_vms; __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, - unsigned long start, unsigned long end, bool blockable) + const struct mmu_notifier_range *range, bool blockable) { return 0; } @@ -415,7 +415,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * count is also read inside the mmu_lock critical section. */ kvm->mmu_notifier_count++; - need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end); + need_tlb_flush = kvm_unmap_hva_range(kvm, range); need_tlb_flush |= kvm->tlbs_dirty; /* we've to flush the tlb before the pages can be freed */ if (need_tlb_flush) @@ -423,8 +423,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, spin_unlock(&kvm->mmu_lock); - ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start, - range->end, + ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range, mmu_notifier_range_blockable(range)); srcu_read_unlock(&kvm->srcu, idx); -- 2.20.1
Pass a set of flags to all Stage-2 unmapping functions. The only value passed for now is zero, and it is not evaluated yet. Signed-off-by: Marc Zyngier <maz@kernel.org> --- virt/kvm/arm/mmu.c | 47 ++++++++++++++++++++++++++-------------------- 1 file changed, 27 insertions(+), 20 deletions(-) diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index 078e10c5650e..0fed7c19c6d5 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -152,7 +152,8 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc) return p; } -static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr) +static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr, + unsigned long flags) { pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL); stage2_pgd_clear(kvm, pgd); @@ -161,7 +162,8 @@ static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr put_page(virt_to_page(pgd)); } -static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr) +static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr, + unsigned long flags) { pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0); VM_BUG_ON(stage2_pud_huge(kvm, *pud)); @@ -171,7 +173,8 @@ static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr put_page(virt_to_page(pud)); } -static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr) +static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr, + unsigned long flags) { pte_t *pte_table = pte_offset_kernel(pmd, 0); VM_BUG_ON(pmd_thp_or_huge(*pmd)); @@ -235,7 +238,8 @@ static inline void kvm_pgd_populate(pgd_t *pgdp, pud_t *pudp) * does. */ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd, - phys_addr_t addr, phys_addr_t end) + phys_addr_t addr, phys_addr_t end, + unsigned long flags) { phys_addr_t start_addr = addr; pte_t *pte, *start_pte; @@ -257,11 +261,12 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd, } while (pte++, addr += PAGE_SIZE, addr != end); if (stage2_pte_table_empty(kvm, start_pte)) - clear_stage2_pmd_entry(kvm, pmd, start_addr); + clear_stage2_pmd_entry(kvm, pmd, start_addr, flags); } static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud, - phys_addr_t addr, phys_addr_t end) + phys_addr_t addr, phys_addr_t end, + unsigned long flags) { phys_addr_t next, start_addr = addr; pmd_t *pmd, *start_pmd; @@ -280,17 +285,18 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud, put_page(virt_to_page(pmd)); } else { - unmap_stage2_ptes(kvm, pmd, addr, next); + unmap_stage2_ptes(kvm, pmd, addr, next, flags); } } } while (pmd++, addr = next, addr != end); if (stage2_pmd_table_empty(kvm, start_pmd)) - clear_stage2_pud_entry(kvm, pud, start_addr); + clear_stage2_pud_entry(kvm, pud, start_addr, flags); } static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd, - phys_addr_t addr, phys_addr_t end) + phys_addr_t addr, phys_addr_t end, + unsigned long flags) { phys_addr_t next, start_addr = addr; pud_t *pud, *start_pud; @@ -307,13 +313,13 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd, kvm_flush_dcache_pud(old_pud); put_page(virt_to_page(pud)); } else { - unmap_stage2_pmds(kvm, pud, addr, next); + unmap_stage2_pmds(kvm, pud, addr, next, flags); } } } while (pud++, addr = next, addr != end); if (stage2_pud_table_empty(kvm, start_pud)) - clear_stage2_pgd_entry(kvm, pgd, start_addr); + clear_stage2_pgd_entry(kvm, pgd, start_addr, flags); } /** @@ -327,7 +333,8 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd, * destroying the VM), otherwise another faulting VCPU may come in and mess * with things behind our backs. */ -static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size) +static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size, + unsigned long flags) { pgd_t *pgd; phys_addr_t addr = start, end = start + size; @@ -347,7 +354,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size) break; next = stage2_pgd_addr_end(kvm, addr, end); if (!stage2_pgd_none(kvm, *pgd)) - unmap_stage2_puds(kvm, pgd, addr, next); + unmap_stage2_puds(kvm, pgd, addr, next, flags); /* * If the range is too large, release the kvm->mmu_lock * to prevent starvation and lockup detector warnings. @@ -950,7 +957,7 @@ static void stage2_unmap_memslot(struct kvm *kvm, if (!(vma->vm_flags & VM_PFNMAP)) { gpa_t gpa = addr + (vm_start - memslot->userspace_addr); - unmap_stage2_range(kvm, gpa, vm_end - vm_start); + unmap_stage2_range(kvm, gpa, vm_end - vm_start, 0); } hva = vm_end; } while (hva < reg_end); @@ -996,7 +1003,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm) spin_lock(&kvm->mmu_lock); if (kvm->arch.pgd) { - unmap_stage2_range(kvm, 0, kvm_phys_size(kvm)); + unmap_stage2_range(kvm, 0, kvm_phys_size(kvm), 0); pgd = READ_ONCE(kvm->arch.pgd); kvm->arch.pgd = NULL; kvm->arch.pgd_phys = 0; @@ -1086,7 +1093,7 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache * get handled accordingly. */ if (!pmd_thp_or_huge(old_pmd)) { - unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE); + unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE, 0); goto retry; } /* @@ -1136,7 +1143,7 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac * the range for this block and retry. */ if (!stage2_pud_huge(kvm, old_pud)) { - unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE); + unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE, 0); goto retry; } @@ -2031,7 +2038,7 @@ static int handle_hva_to_gpa(struct kvm *kvm, static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data) { - unmap_stage2_range(kvm, gpa, size); + unmap_stage2_range(kvm, gpa, size, 0); return 0; } @@ -2344,7 +2351,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, spin_lock(&kvm->mmu_lock); if (ret) - unmap_stage2_range(kvm, mem->guest_phys_addr, mem->memory_size); + unmap_stage2_range(kvm, mem->guest_phys_addr, mem->memory_size, 0); else stage2_flush_memslot(kvm, memslot); spin_unlock(&kvm->mmu_lock); @@ -2380,7 +2387,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm, phys_addr_t size = slot->npages << PAGE_SHIFT; spin_lock(&kvm->mmu_lock); - unmap_stage2_range(kvm, gpa, size); + unmap_stage2_range(kvm, gpa, size, 0); spin_unlock(&kvm->mmu_lock); } -- 2.20.1
In order to allow the elision of cache maintenance operations on unmap, add a new flag (KVM_UNMAP_ELIDE_CMO) that a caller can use to indicate that CMOs are not required. Nobody is passing this flag yet, hence no functional change. Signed-off-by: Marc Zyngier <maz@kernel.org> --- virt/kvm/arm/mmu.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index 0fed7c19c6d5..ebf8c87cc007 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -35,6 +35,9 @@ static unsigned long io_map_base; #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) +/* Flags controlling S2 unmapping */ +#define KVM_UNMAP_ELIDE_CMO (1UL << 0) + #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0) #define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1) @@ -253,7 +256,8 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd, kvm_tlb_flush_vmid_ipa(kvm, addr); /* No need to invalidate the cache for device mappings */ - if (!kvm_is_device_pfn(pte_pfn(old_pte))) + if (!kvm_is_device_pfn(pte_pfn(old_pte)) && + !(flags & KVM_UNMAP_ELIDE_CMO)) kvm_flush_dcache_pte(old_pte); put_page(virt_to_page(pte)); @@ -281,7 +285,8 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud, pmd_clear(pmd); kvm_tlb_flush_vmid_ipa(kvm, addr); - kvm_flush_dcache_pmd(old_pmd); + if (!(flags & KVM_UNMAP_ELIDE_CMO)) + kvm_flush_dcache_pmd(old_pmd); put_page(virt_to_page(pmd)); } else { @@ -310,7 +315,8 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd, stage2_pud_clear(kvm, pud); kvm_tlb_flush_vmid_ipa(kvm, addr); - kvm_flush_dcache_pud(old_pud); + if (!(flags & KVM_UNMAP_ELIDE_CMO)) + kvm_flush_dcache_pud(old_pud); put_page(virt_to_page(pud)); } else { unmap_stage2_pmds(kvm, pud, addr, next, flags); -- 2.20.1
In order to allow the elision of TLB maintenance operations on unmap, add a new flag (KVM_UNMAP_ELIDE_TBLI) that a caller can use to indicate that TLB invalidation is not required. Nobody is passing this flag yet, hence no functional change. Signed-off-by: Marc Zyngier <maz@kernel.org> --- virt/kvm/arm/mmu.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index ebf8c87cc007..4399866842dc 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -37,6 +37,7 @@ static unsigned long io_map_base; /* Flags controlling S2 unmapping */ #define KVM_UNMAP_ELIDE_CMO (1UL << 0) +#define KVM_UNMAP_ELIDE_TBLI (1UL << 1) #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0) #define KVM_S2_FLAG_LOGGING_ACTIVE (1UL << 1) @@ -160,7 +161,8 @@ static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr { pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL); stage2_pgd_clear(kvm, pgd); - kvm_tlb_flush_vmid_ipa(kvm, addr); + if (!(flags & KVM_UNMAP_ELIDE_TBLI)) + kvm_tlb_flush_vmid_ipa(kvm, addr); stage2_pud_free(kvm, pud_table); put_page(virt_to_page(pgd)); } @@ -171,7 +173,8 @@ static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0); VM_BUG_ON(stage2_pud_huge(kvm, *pud)); stage2_pud_clear(kvm, pud); - kvm_tlb_flush_vmid_ipa(kvm, addr); + if (!(flags & KVM_UNMAP_ELIDE_TBLI)) + kvm_tlb_flush_vmid_ipa(kvm, addr); stage2_pmd_free(kvm, pmd_table); put_page(virt_to_page(pud)); } @@ -182,7 +185,8 @@ static void clear_stage2_pmd_entry(struct kvm *kvm, pmd_t *pmd, phys_addr_t addr pte_t *pte_table = pte_offset_kernel(pmd, 0); VM_BUG_ON(pmd_thp_or_huge(*pmd)); pmd_clear(pmd); - kvm_tlb_flush_vmid_ipa(kvm, addr); + if (!(flags & KVM_UNMAP_ELIDE_TBLI)) + kvm_tlb_flush_vmid_ipa(kvm, addr); free_page((unsigned long)pte_table); put_page(virt_to_page(pmd)); } @@ -253,7 +257,8 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd, pte_t old_pte = *pte; kvm_set_pte(pte, __pte(0)); - kvm_tlb_flush_vmid_ipa(kvm, addr); + if (!(flags & KVM_UNMAP_ELIDE_TBLI)) + kvm_tlb_flush_vmid_ipa(kvm, addr); /* No need to invalidate the cache for device mappings */ if (!kvm_is_device_pfn(pte_pfn(old_pte)) && @@ -283,7 +288,8 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud, pmd_t old_pmd = *pmd; pmd_clear(pmd); - kvm_tlb_flush_vmid_ipa(kvm, addr); + if (!(flags & KVM_UNMAP_ELIDE_TBLI)) + kvm_tlb_flush_vmid_ipa(kvm, addr); if (!(flags & KVM_UNMAP_ELIDE_CMO)) kvm_flush_dcache_pmd(old_pmd); @@ -314,7 +320,8 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd, pud_t old_pud = *pud; stage2_pud_clear(kvm, pud); - kvm_tlb_flush_vmid_ipa(kvm, addr); + if (!(flags & KVM_UNMAP_ELIDE_TBLI)) + kvm_tlb_flush_vmid_ipa(kvm, addr); if (!(flags & KVM_UNMAP_ELIDE_CMO)) kvm_flush_dcache_pud(old_pud); put_page(virt_to_page(pud)); -- 2.20.1
When freeing the whole of a VM's Stage-2 page tables, there is little point in doing cache maintenance on each and every page (the guest won't be running anymore, let alone having its MMU off). As for TLBs, there is no point in invalidating individual pages, as we can replace the whole thing with a VMALL operation, which invalidates all the TLBs for this VM in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> --- virt/kvm/arm/mmu.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index 4399866842dc..d7c710491d26 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -1016,10 +1016,13 @@ void kvm_free_stage2_pgd(struct kvm *kvm) spin_lock(&kvm->mmu_lock); if (kvm->arch.pgd) { - unmap_stage2_range(kvm, 0, kvm_phys_size(kvm), 0); + unmap_stage2_range(kvm, 0, kvm_phys_size(kvm), + KVM_UNMAP_ELIDE_CMO | KVM_UNMAP_ELIDE_TBLI); pgd = READ_ONCE(kvm->arch.pgd); kvm->arch.pgd = NULL; kvm->arch.pgd_phys = 0; + + kvm_flush_remote_tlbs(kvm); } spin_unlock(&kvm->mmu_lock); -- 2.20.1
In the rare cases where we're converting a table mapping in a block mapping, there is no point in cleaning memory to the PoC, as we're about to remap the exact same pages again, only as a block mapping. Signed-off-by: Marc Zyngier <maz@kernel.org> --- virt/kvm/arm/mmu.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index d7c710491d26..c55022dbac89 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -1109,7 +1109,8 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache * get handled accordingly. */ if (!pmd_thp_or_huge(old_pmd)) { - unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE, 0); + unmap_stage2_range(kvm, addr & S2_PMD_MASK, S2_PMD_SIZE, + KVM_UNMAP_ELIDE_CMO); goto retry; } /* @@ -1159,7 +1160,8 @@ static int stage2_set_pud_huge(struct kvm *kvm, struct kvm_mmu_memory_cache *cac * the range for this block and retry. */ if (!stage2_pud_huge(kvm, old_pud)) { - unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE, 0); + unmap_stage2_range(kvm, addr & S2_PUD_MASK, S2_PUD_SIZE, + KVM_UNMAP_ELIDE_CMO); goto retry; } -- 2.20.1
If userspace issues a munmap() on a set of pages, there is no expectation that the pages are cleaned to the PoC. So let's not do more work than strictly necessary, and set the magic flag that avoids CMOs in this case. Signed-off-by: Marc Zyngier <maz@kernel.org> --- virt/kvm/arm/mmu.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index c55022dbac89..6749be33d822 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -2056,7 +2056,13 @@ static int handle_hva_to_gpa(struct kvm *kvm, static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *data) { - unmap_stage2_range(kvm, gpa, size, 0); + struct mmu_notifier_range *range = data; + unsigned long flags = 0; + + if (range->event == MMU_NOTIFY_UNMAP) + flags = KVM_UNMAP_ELIDE_CMO; + + unmap_stage2_range(kvm, gpa, size, flags); return 0; } @@ -2067,7 +2073,7 @@ int kvm_unmap_hva_range(struct kvm *kvm, const struct mmu_notifier_range *range) trace_kvm_unmap_hva_range(range->start, range->end); handle_hva_to_gpa(kvm, range->start, range->end, - &kvm_unmap_hva_handler, NULL); + &kvm_unmap_hva_handler, (void *)range); return 0; } -- 2.20.1
Hi Marc, On 13/12/2019 18:24, Marc Zyngier wrote: > kvm_unmap_hva_range() is currently passed both start and end > fields from the mmu_notifier_range structure. As this struct > now contains important information about the reason of the > unmap (the event field), replace the start/end parameters > with the range struct, and update all architectures. > > No functionnal change. > > Signed-off-by: Marc Zyngier <maz@kernel.org> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 00268290dcbd..7c3665ad1035 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -158,7 +158,7 @@ static unsigned long long kvm_createvm_count; > static unsigned long long kvm_active_vms; > > __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, > - unsigned long start, unsigned long end, bool blockable) > + const struct mmu_notifier_range *range, bool blockable) > { > return 0; > } > @@ -415,7 +415,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > * count is also read inside the mmu_lock critical section. > */ > kvm->mmu_notifier_count++; > - need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end); > + need_tlb_flush = kvm_unmap_hva_range(kvm, range); > need_tlb_flush |= kvm->tlbs_dirty; > /* we've to flush the tlb before the pages can be freed */ > if (need_tlb_flush) > @@ -423,8 +423,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > spin_unlock(&kvm->mmu_lock); > > - ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start, > - range->end, > + ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range, > mmu_notifier_range_blockable(range)); minor nit: Since we now have the range passed on to the arch hooks, we could get rid of the "blockable" too, as it is something you can deduce from the range. Otherwise looks good to me. Suzuki
Hi Suzuki, On Fri, 13 Dec 2019 18:59:32 +0000, Suzuki Kuruppassery Poulose <suzuki.poulose@arm.com> wrote: > > Hi Marc, > > > On 13/12/2019 18:24, Marc Zyngier wrote: > > kvm_unmap_hva_range() is currently passed both start and end > > fields from the mmu_notifier_range structure. As this struct > > now contains important information about the reason of the > > unmap (the event field), replace the start/end parameters > > with the range struct, and update all architectures. > > > > No functionnal change. > > > > Signed-off-by: Marc Zyngier <maz@kernel.org> > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 00268290dcbd..7c3665ad1035 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -158,7 +158,7 @@ static unsigned long long kvm_createvm_count; > > static unsigned long long kvm_active_vms; > > __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm > > *kvm, > > - unsigned long start, unsigned long end, bool blockable) > > + const struct mmu_notifier_range *range, bool blockable) > > { > > return 0; > > } > > @@ -415,7 +415,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > * count is also read inside the mmu_lock critical section. > > */ > > kvm->mmu_notifier_count++; > > - need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end); > > + need_tlb_flush = kvm_unmap_hva_range(kvm, range); > > need_tlb_flush |= kvm->tlbs_dirty; > > /* we've to flush the tlb before the pages can be freed */ > > if (need_tlb_flush) > > @@ -423,8 +423,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > > spin_unlock(&kvm->mmu_lock); > > - ret = kvm_arch_mmu_notifier_invalidate_range(kvm, > > range->start, > > - range->end, > > + ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range, > > mmu_notifier_range_blockable(range)); > > minor nit: > > Since we now have the range passed on to the arch hooks, we could get > rid of the "blockable" too, as it is something you can deduce from the > range. Absolutely. That'd be a nice cleanup. > Otherwise looks good to me. Thanks, M. -- Jazz is not dead, it just smells funny.
On Fri, 13 Dec 2019 18:24:57 +0000
Marc Zyngier <maz@kernel.org> wrote:
> kvm_unmap_hva_range() is currently passed both start and end
> fields from the mmu_notifier_range structure. As this struct
> now contains important information about the reason of the
> unmap (the event field), replace the start/end parameters
> with the range struct, and update all architectures.
>
> No functionnal change.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
> arch/arm/include/asm/kvm_host.h | 2 +-
> arch/arm64/include/asm/kvm_host.h | 2 +-
> arch/mips/include/asm/kvm_host.h | 2 +-
> arch/mips/kvm/mmu.c | 6 ++++--
> arch/powerpc/include/asm/kvm_host.h | 2 +-
> arch/powerpc/kvm/book3s.c | 5 +++--
> arch/powerpc/kvm/e500_mmu_host.c | 4 ++--
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 5 +++--
> arch/x86/kvm/x86.c | 4 ++--
> include/linux/kvm_host.h | 2 +-
> virt/kvm/arm/mmu.c | 8 ++++----
> virt/kvm/kvm_main.c | 7 +++----
> 13 files changed, 28 insertions(+), 24 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 556cd818eccf..621c71594499 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -276,7 +276,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>
> unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index c61260cf63c5..dd850f5e81e3 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -441,7 +441,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 41204a49cf95..0ed065870f1b 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -935,7 +935,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 7dad7a293eae..32ef868258b9 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -518,9 +518,11 @@ static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
> return 1;
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range)
> {
> - handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL);
> + handle_hva_to_gpa(kvm, range->start, range->end,
> + &kvm_unmap_hva_handler, NULL);
>
> kvm_mips_callbacks->flush_shadow_all(kvm);
> return 0;
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 0a398f2321c2..8cef585e0abe 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -58,7 +58,7 @@
> #define KVM_ARCH_WANT_MMU_NOTIFIER
>
> extern int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 58a59ee998e2..a1529a0dd656 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -842,9 +842,10 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
> kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range)
> {
> - return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end);
> + return kvm->arch.kvm_ops->unmap_hva_range(kvm, range->start, range->end);
> }
>
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index 425d13806645..5a7211901063 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -734,10 +734,10 @@ static int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
> return 0;
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm, const struct mmu_notifier_range *range)
> {
> /* kvm_unmap_hva flushes everything anyways */
> - kvm_unmap_hva(kvm, start);
> + kvm_unmap_hva(kvm, range->start);
>
> return 0;
> }
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b79cd6aa4075..c479fa845d72 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1569,7 +1569,8 @@ asmlinkage void kvm_spurious_fault(void);
> _ASM_EXTABLE(666b, 667b)
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end);
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range);
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..86831be07c17 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2040,9 +2040,10 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler);
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range);
> {
> - return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> + return kvm_handle_hva_range(kvm, range->start, range->end, 0, kvm_unmap_rmapp);
> }
As kindly pointed out by the kbuild robot, this doesn't even compile...
I've fixed it locally, incorporating Suzuki's feedback at the same time.
M.
--
Jazz is not dead. It just smells funny...
Hi Marc, On 13/12/2019 18:25, Marc Zyngier wrote: > If userspace issues a munmap() on a set of pages, there is no > expectation that the pages are cleaned to the PoC. (Pedantry: Clean and invalidate. If the guest wrote through a device mapping, we ditch any clean+stale lines with this path, meaning swapout saves the correct values) > So let's > not do more work than strictly necessary, and set the magic > flag that avoids CMOs in this case. I think this assumes the pages went from anonymous->free, so no-one cares about the contents. If the pages are backed by a file, won't dirty pages will still get written back before the page is free? (e.g. EFI flash 'file' mmap()ed in) What if this isn't the only mapping of the page? Can't it be swapped out from another VMA? (tenuous example, poor man's memory mirroring?) Thanks, James
Hi James, On 2019-12-18 15:07, James Morse wrote: > Hi Marc, > > On 13/12/2019 18:25, Marc Zyngier wrote: >> If userspace issues a munmap() on a set of pages, there is no >> expectation that the pages are cleaned to the PoC. > > (Pedantry: Clean and invalidate. If the guest wrote through a device > mapping, we ditch any clean+stale lines with this path, meaning > swapout > saves the correct values) Indeed. >> So let's >> not do more work than strictly necessary, and set the magic >> flag that avoids CMOs in this case. > > I think this assumes the pages went from anonymous->free, so no-one > cares about the contents. > > If the pages are backed by a file, won't dirty pages will still get > written back before the page is free? (e.g. EFI flash 'file' mmap()ed > in) I believe so. Is that a problem? > What if this isn't the only mapping of the page? Can't it be swapped > out from another VMA? (tenuous example, poor man's memory mirroring?) Swap-out wouldn't trigger this code path, as it would use a different MMU notifier event (MMU_NOTIFY_CLEAR vs MMU_NOTIFY_UNMAP), I believe. Thanks, M. -- Jazz is not dead. It just smells funny...
Hi Marc, On 18/12/2019 15:30, Marc Zyngier wrote: > On 2019-12-18 15:07, James Morse wrote: >> On 13/12/2019 18:25, Marc Zyngier wrote: >>> If userspace issues a munmap() on a set of pages, there is no >>> expectation that the pages are cleaned to the PoC. >>> So let's >>> not do more work than strictly necessary, and set the magic >>> flag that avoids CMOs in this case. >> >> I think this assumes the pages went from anonymous->free, so no-one >> cares about the contents. >> >> If the pages are backed by a file, won't dirty pages will still get >> written back before the page is free? (e.g. EFI flash 'file' mmap()ed in) > > I believe so. Is that a problem? If we skipped the dcache maintenance on unmap, when the the dirty page is later reclaimed the clean+stale lines are written back to the file. File-backed dirty pages will stick around in the page cache in the hope someone else needs them. This would happen for a guest:device-mapping that is written to, but is actually backed by a mmap()d file. I think the EFI flash emulation does exactly this. >> What if this isn't the only mapping of the page? Can't it be swapped >> out from another VMA? (tenuous example, poor man's memory mirroring?) > > Swap-out wouldn't trigger this code path, as it would use a different > MMU notifier event (MMU_NOTIFY_CLEAR vs MMU_NOTIFY_UNMAP), I believe. This was a half thought-through special case of the above. The sequence would be: VM-A and VM-B both share a mapping of $page. 1. VM-A writes to $page through a device mapping 2. The kernel unmaps $page from VM-A for swap. KVM does the maintenance 3. VM-B writes to $page through a device mapping 4. VM-B exits, KVM skips the maintenance, $page may have clean+stale lines 5. Swap finds no further mappings, and writes $page and its clean+stale lines to disk. Two VMs with a shared mapping is the 'easy' example. I think you just need a second mapping for this to happen: it means the page isn't really free after the VM has exited. Thanks, James
On 13/12/19 19:24, Marc Zyngier wrote:
> kvm_unmap_hva_range() is currently passed both start and end
> fields from the mmu_notifier_range structure. As this struct
> now contains important information about the reason of the
> unmap (the event field), replace the start/end parameters
> with the range struct, and update all architectures.
>
> No functionnal change.
>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
> arch/arm/include/asm/kvm_host.h | 2 +-
> arch/arm64/include/asm/kvm_host.h | 2 +-
> arch/mips/include/asm/kvm_host.h | 2 +-
> arch/mips/kvm/mmu.c | 6 ++++--
> arch/powerpc/include/asm/kvm_host.h | 2 +-
> arch/powerpc/kvm/book3s.c | 5 +++--
> arch/powerpc/kvm/e500_mmu_host.c | 4 ++--
> arch/x86/include/asm/kvm_host.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 5 +++--
> arch/x86/kvm/x86.c | 4 ++--
> include/linux/kvm_host.h | 2 +-
> virt/kvm/arm/mmu.c | 8 ++++----
> virt/kvm/kvm_main.c | 7 +++----
> 13 files changed, 28 insertions(+), 24 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 556cd818eccf..621c71594499 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -276,7 +276,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
>
> unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index c61260cf63c5..dd850f5e81e3 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -441,7 +441,7 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index 41204a49cf95..0ed065870f1b 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -935,7 +935,7 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct kvm_vcpu *vcpu,
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
> index 7dad7a293eae..32ef868258b9 100644
> --- a/arch/mips/kvm/mmu.c
> +++ b/arch/mips/kvm/mmu.c
> @@ -518,9 +518,11 @@ static int kvm_unmap_hva_handler(struct kvm *kvm, gfn_t gfn, gfn_t gfn_end,
> return 1;
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range)
> {
> - handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL);
> + handle_hva_to_gpa(kvm, range->start, range->end,
> + &kvm_unmap_hva_handler, NULL);
>
> kvm_mips_callbacks->flush_shadow_all(kvm);
> return 0;
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 0a398f2321c2..8cef585e0abe 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -58,7 +58,7 @@
> #define KVM_ARCH_WANT_MMU_NOTIFIER
>
> extern int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end);
> + const struct mmu_notifier_range *range);
> extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 58a59ee998e2..a1529a0dd656 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -842,9 +842,10 @@ void kvmppc_core_commit_memory_region(struct kvm *kvm,
> kvm->arch.kvm_ops->commit_memory_region(kvm, mem, old, new, change);
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range)
> {
> - return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end);
> + return kvm->arch.kvm_ops->unmap_hva_range(kvm, range->start, range->end);
> }
>
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
> diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
> index 425d13806645..5a7211901063 100644
> --- a/arch/powerpc/kvm/e500_mmu_host.c
> +++ b/arch/powerpc/kvm/e500_mmu_host.c
> @@ -734,10 +734,10 @@ static int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
> return 0;
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm, const struct mmu_notifier_range *range)
> {
> /* kvm_unmap_hva flushes everything anyways */
> - kvm_unmap_hva(kvm, start);
> + kvm_unmap_hva(kvm, range->start);
>
> return 0;
> }
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b79cd6aa4075..c479fa845d72 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1569,7 +1569,8 @@ asmlinkage void kvm_spurious_fault(void);
> _ASM_EXTABLE(666b, 667b)
>
> #define KVM_ARCH_WANT_MMU_NOTIFIER
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end);
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range);
> int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
> int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f92b40d798c..86831be07c17 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2040,9 +2040,10 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler);
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm,
> + const struct mmu_notifier_range *range);
> {
> - return kvm_handle_hva_range(kvm, start, end, 0, kvm_unmap_rmapp);
> + return kvm_handle_hva_range(kvm, range->start, range->end, 0, kvm_unmap_rmapp);
> }
>
> int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cf917139de6b..c1a238f4ee35 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7942,7 +7942,7 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> }
>
> int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> - unsigned long start, unsigned long end,
> + const struct mmu_notifier_range *range,
> bool blockable)
> {
> unsigned long apic_address;
> @@ -7952,7 +7952,7 @@ int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> * Update it when it becomes invalid.
> */
> apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> - if (start <= apic_address && apic_address < end)
> + if (range->start <= apic_address && apic_address < range->end)
> kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>
> return 0;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 7ed1e2f8641e..d6e2ae2accc4 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1391,7 +1391,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
> #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>
> int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> - unsigned long start, unsigned long end, bool blockable);
> + const struct mmu_notifier_range *range, bool blockable);
>
> #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
> int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 38b4c910b6c3..078e10c5650e 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -2035,14 +2035,14 @@ static int kvm_unmap_hva_handler(struct kvm *kvm, gpa_t gpa, u64 size, void *dat
> return 0;
> }
>
> -int kvm_unmap_hva_range(struct kvm *kvm,
> - unsigned long start, unsigned long end)
> +int kvm_unmap_hva_range(struct kvm *kvm, const struct mmu_notifier_range *range)
> {
> if (!kvm->arch.pgd)
> return 0;
>
> - trace_kvm_unmap_hva_range(start, end);
> - handle_hva_to_gpa(kvm, start, end, &kvm_unmap_hva_handler, NULL);
> + trace_kvm_unmap_hva_range(range->start, range->end);
> + handle_hva_to_gpa(kvm, range->start, range->end,
> + &kvm_unmap_hva_handler, NULL);
> return 0;
> }
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 00268290dcbd..7c3665ad1035 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -158,7 +158,7 @@ static unsigned long long kvm_createvm_count;
> static unsigned long long kvm_active_vms;
>
> __weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> - unsigned long start, unsigned long end, bool blockable)
> + const struct mmu_notifier_range *range, bool blockable)
> {
> return 0;
> }
> @@ -415,7 +415,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> * count is also read inside the mmu_lock critical section.
> */
> kvm->mmu_notifier_count++;
> - need_tlb_flush = kvm_unmap_hva_range(kvm, range->start, range->end);
> + need_tlb_flush = kvm_unmap_hva_range(kvm, range);
> need_tlb_flush |= kvm->tlbs_dirty;
> /* we've to flush the tlb before the pages can be freed */
> if (need_tlb_flush)
> @@ -423,8 +423,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>
> spin_unlock(&kvm->mmu_lock);
>
> - ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range->start,
> - range->end,
> + ret = kvm_arch_mmu_notifier_invalidate_range(kvm, range,
> mmu_notifier_range_blockable(range));
>
> srcu_read_unlock(&kvm->srcu, idx);
>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>