From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1427053AbdDVA2z (ORCPT ); Fri, 21 Apr 2017 20:28:55 -0400 Received: from mx2.suse.de ([195.135.220.15]:50283 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1161469AbdDVA2t (ORCPT ); Fri, 21 Apr 2017 20:28:49 -0400 Subject: Re: [PATCH v3] kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd To: Suzuki K Poulose , Christoffer Dall References: <1491228763-23450-1-git-send-email-suzuki.poulose@arm.com> <20170404101316.GF11752@cbox> <28bece63-9917-5f00-ccdf-fe663def500f@arm.com> Cc: linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, ard.biesheuvel@linaro.org, marc.zyngier@arm.com, andreyknvl@google.com, will.deacon@arm.com, linux-kernel@vger.kernel.org, stable@vger.kernel.org, kcc@google.com, syzkaller@googlegroups.com, dvyukov@google.com, catalin.marinas@arm.com, pbonzini@redhat.com, kvmarm@lists.cs.columbia.edu From: Alexander Graf Message-ID: Date: Sat, 22 Apr 2017 02:28:44 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <28bece63-9917-5f00-ccdf-fe663def500f@arm.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04.04.17 12:35, Suzuki K Poulose wrote: > Hi Christoffer, > > On 04/04/17 11:13, Christoffer Dall wrote: >> Hi Suzuki, >> >> On Mon, Apr 03, 2017 at 03:12:43PM +0100, Suzuki K Poulose wrote: >>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling >>> unmap_stage2_range() on the entire memory range for the guest. This >>> could >>> cause problems with other callers (e.g, munmap on a memslot) trying to >>> unmap a range. And since we have to unmap the entire Guest memory range >>> holding a spinlock, make sure we yield the lock if necessary, after we >>> unmap each PUD range. >>> >>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") >>> Cc: stable@vger.kernel.org # v3.10+ >>> Cc: Paolo Bonzini >>> Cc: Marc Zyngier >>> Cc: Christoffer Dall >>> Cc: Mark Rutland >>> Signed-off-by: Suzuki K Poulose >>> [ Avoid vCPU starvation and lockup detector warnings ] >>> Signed-off-by: Marc Zyngier >>> Signed-off-by: Suzuki K Poulose >>> >> >> This unfortunately fails to build on 32-bit ARM, and I also think we >> intended to check against S2_PGDIR_SIZE, not S2_PUD_SIZE. > > Sorry about that, I didn't test the patch with arm32. I am fine the > patch below. And I agree that the name change does make things more > readable. See below for a hunk that I posted to the kbuild report. > >> >> How about adding this to your patch (which includes a rename of >> S2_PGD_SIZE which is horribly confusing as it indicates the size of the >> first level stage-2 table itself, where S2_PGDIR_SIZE indicates the size >> of address space mapped by a single entry in the same table): >> >> diff --git a/arch/arm/include/asm/stage2_pgtable.h >> b/arch/arm/include/asm/stage2_pgtable.h >> index 460d616..c997f2d 100644 >> --- a/arch/arm/include/asm/stage2_pgtable.h >> +++ b/arch/arm/include/asm/stage2_pgtable.h >> @@ -35,10 +35,13 @@ >> >> #define stage2_pud_huge(pud) pud_huge(pud) >> >> +#define S2_PGDIR_SIZE PGDIR_SIZE >> +#define S2_PGDIR_MASK PGDIR_MASK >> + >> /* Open coded p*d_addr_end that can deal with 64bit addresses */ >> static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, >> phys_addr_t end) >> { >> - phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; >> + phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK; >> >> return (boundary - 1 < end - 1) ? boundary : end; >> } >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c >> index db94f3a..6e79a4c 100644 >> --- a/arch/arm/kvm/mmu.c >> +++ b/arch/arm/kvm/mmu.c >> @@ -41,7 +41,7 @@ static unsigned long hyp_idmap_start; >> static unsigned long hyp_idmap_end; >> static phys_addr_t hyp_idmap_vector; >> >> -#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t)) >> +#define S2_PGD_TABLE_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t)) >> #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) >> >> #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0) >> @@ -299,7 +299,7 @@ static void unmap_stage2_range(struct kvm *kvm, >> phys_addr_t start, u64 size) >> * If the range is too large, release the kvm->mmu_lock >> * to prevent starvation and lockup detector warnings. >> */ >> - if (size > S2_PUD_SIZE) >> + if (size > S2_PGDIR_SIZE) >> cond_resched_lock(&kvm->mmu_lock); >> next = stage2_pgd_addr_end(addr, end); >> if (!stage2_pgd_none(*pgd)) >> @@ -747,7 +747,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm) >> } >> >> /* Allocate the HW PGD, making sure that each page gets its own >> refcount */ >> - pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO); >> + pgd = alloc_pages_exact(S2_PGD_TABLE_SIZE, GFP_KERNEL | __GFP_ZERO); >> if (!pgd) >> return -ENOMEM; >> >> @@ -843,7 +843,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm) >> spin_unlock(&kvm->mmu_lock); >> >> /* Free the HW pgd, one page at a time */ >> - free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE); >> + free_pages_exact(kvm->arch.pgd, S2_PGD_TABLE_SIZE); >> kvm->arch.pgd = NULL; >> } >> > > Btw, I have a different hunk to solve the problem, posted to the kbuild > report. I will post it here for the sake of capturing the discussion in > one place. The following hunk on top of the patch, changes the lock > release after we process one PGDIR entry. As for the first time > we enter the loop we haven't done much with the lock held, hence it may > make > sense to do it after the first round and we have more work to do. > > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c > index db94f3a..582a972 100644 > --- a/arch/arm/kvm/mmu.c > +++ b/arch/arm/kvm/mmu.c > @@ -295,15 +295,15 @@ static void unmap_stage2_range(struct kvm *kvm, > phys_addr_t start, u64 size) > assert_spin_locked(&kvm->mmu_lock); > pgd = kvm->arch.pgd + stage2_pgd_index(addr); > do { > + next = stage2_pgd_addr_end(addr, end); > + if (!stage2_pgd_none(*pgd)) Just as heads up, I had this version applied to my tree by accident (commit 8b3405e345b5a098101b0c31b264c812bba045d9 from Christoffer's queue) and ran into a NULL pointer dereference: [223090.242280] Unable to handle kernel NULL pointer dereference at virtual address 00000040 [223090.262330] PC is at unmap_stage2_range+0x8c/0x428 [223090.262332] LR is at kvm_unmap_hva_handler+0x2c/0x3c [223090.262531] Call trace: [223090.262533] [] unmap_stage2_range+0x8c/0x428 [223090.262535] [] kvm_unmap_hva_handler+0x2c/0x3c [223090.262537] [] handle_hva_to_gpa+0xb0/0x104 [223090.262539] [] kvm_unmap_hva+0x5c/0xbc [223090.262543] [] kvm_mmu_notifier_invalidate_page+0x50/0x8c [223090.262547] [] __mmu_notifier_invalidate_page+0x5c/0x84 [223090.262551] [] try_to_unmap_one+0x1d0/0x4a0 [223090.262553] [] rmap_walk+0x1cc/0x2e0 [223090.262555] [] try_to_unmap+0x74/0xa4 [223090.262557] [] migrate_pages+0x31c/0x5ac [223090.262561] [] compact_zone+0x3fc/0x7ac [223090.262563] [] compact_zone_order+0x94/0xb0 [223090.262564] [] try_to_compact_pages+0x108/0x290 [223090.262569] [] __alloc_pages_direct_compact+0x70/0x1ac [223090.262571] [] __alloc_pages_nodemask+0x434/0x9f4 [223090.262572] [] alloc_pages_vma+0x230/0x254 [223090.262574] [] do_huge_pmd_anonymous_page+0x114/0x538 [223090.262576] [] handle_mm_fault+0xd40/0x17a4 [223090.262577] [] __get_user_pages+0x12c/0x36c [223090.262578] [] get_user_pages_unlocked+0xa4/0x1b8 [223090.262579] [] __gfn_to_pfn_memslot+0x280/0x31c [223090.262580] [] gfn_to_pfn_prot+0x4c/0x5c [223090.262582] [] kvm_handle_guest_abort+0x240/0x774 [223090.262584] [] handle_exit+0x11c/0x1ac [223090.262586] [] kvm_arch_vcpu_ioctl_run+0x31c/0x648 [223090.262587] [] kvm_vcpu_ioctl+0x378/0x768 [223090.262590] [] do_vfs_ioctl+0x324/0x5a4 [223090.262591] [] SyS_ioctl+0x90/0xa4 [223090.262595] [] el0_svc_naked+0x38/0x3c 0xffff0000080adb78 is in unmap_stage2_range (../arch/arm/kvm/mmu.c:260). 255 pud_t *pud, *start_pud; 256 257 start_pud = pud = stage2_pud_offset(pgd, addr); 258 do { 259 next = stage2_pud_addr_end(addr, end); 260 if (!stage2_pud_none(*pud)) { 261 if (stage2_pud_huge(*pud)) { 262 pud_t old_pud = *pud; 263 264 stage2_pud_clear(pud); So please beware that for some reason pud may become invalid after rescheduling. Alex From mboxrd@z Thu Jan 1 00:00:00 1970 From: agraf@suse.de (Alexander Graf) Date: Sat, 22 Apr 2017 02:28:44 +0200 Subject: [PATCH v3] kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd In-Reply-To: <28bece63-9917-5f00-ccdf-fe663def500f@arm.com> References: <1491228763-23450-1-git-send-email-suzuki.poulose@arm.com> <20170404101316.GF11752@cbox> <28bece63-9917-5f00-ccdf-fe663def500f@arm.com> Message-ID: To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 04.04.17 12:35, Suzuki K Poulose wrote: > Hi Christoffer, > > On 04/04/17 11:13, Christoffer Dall wrote: >> Hi Suzuki, >> >> On Mon, Apr 03, 2017 at 03:12:43PM +0100, Suzuki K Poulose wrote: >>> In kvm_free_stage2_pgd() we don't hold the kvm->mmu_lock while calling >>> unmap_stage2_range() on the entire memory range for the guest. This >>> could >>> cause problems with other callers (e.g, munmap on a memslot) trying to >>> unmap a range. And since we have to unmap the entire Guest memory range >>> holding a spinlock, make sure we yield the lock if necessary, after we >>> unmap each PUD range. >>> >>> Fixes: commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") >>> Cc: stable at vger.kernel.org # v3.10+ >>> Cc: Paolo Bonzini >>> Cc: Marc Zyngier >>> Cc: Christoffer Dall >>> Cc: Mark Rutland >>> Signed-off-by: Suzuki K Poulose >>> [ Avoid vCPU starvation and lockup detector warnings ] >>> Signed-off-by: Marc Zyngier >>> Signed-off-by: Suzuki K Poulose >>> >> >> This unfortunately fails to build on 32-bit ARM, and I also think we >> intended to check against S2_PGDIR_SIZE, not S2_PUD_SIZE. > > Sorry about that, I didn't test the patch with arm32. I am fine the > patch below. And I agree that the name change does make things more > readable. See below for a hunk that I posted to the kbuild report. > >> >> How about adding this to your patch (which includes a rename of >> S2_PGD_SIZE which is horribly confusing as it indicates the size of the >> first level stage-2 table itself, where S2_PGDIR_SIZE indicates the size >> of address space mapped by a single entry in the same table): >> >> diff --git a/arch/arm/include/asm/stage2_pgtable.h >> b/arch/arm/include/asm/stage2_pgtable.h >> index 460d616..c997f2d 100644 >> --- a/arch/arm/include/asm/stage2_pgtable.h >> +++ b/arch/arm/include/asm/stage2_pgtable.h >> @@ -35,10 +35,13 @@ >> >> #define stage2_pud_huge(pud) pud_huge(pud) >> >> +#define S2_PGDIR_SIZE PGDIR_SIZE >> +#define S2_PGDIR_MASK PGDIR_MASK >> + >> /* Open coded p*d_addr_end that can deal with 64bit addresses */ >> static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, >> phys_addr_t end) >> { >> - phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK; >> + phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK; >> >> return (boundary - 1 < end - 1) ? boundary : end; >> } >> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c >> index db94f3a..6e79a4c 100644 >> --- a/arch/arm/kvm/mmu.c >> +++ b/arch/arm/kvm/mmu.c >> @@ -41,7 +41,7 @@ static unsigned long hyp_idmap_start; >> static unsigned long hyp_idmap_end; >> static phys_addr_t hyp_idmap_vector; >> >> -#define S2_PGD_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t)) >> +#define S2_PGD_TABLE_SIZE (PTRS_PER_S2_PGD * sizeof(pgd_t)) >> #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) >> >> #define KVM_S2PTE_FLAG_IS_IOMAP (1UL << 0) >> @@ -299,7 +299,7 @@ static void unmap_stage2_range(struct kvm *kvm, >> phys_addr_t start, u64 size) >> * If the range is too large, release the kvm->mmu_lock >> * to prevent starvation and lockup detector warnings. >> */ >> - if (size > S2_PUD_SIZE) >> + if (size > S2_PGDIR_SIZE) >> cond_resched_lock(&kvm->mmu_lock); >> next = stage2_pgd_addr_end(addr, end); >> if (!stage2_pgd_none(*pgd)) >> @@ -747,7 +747,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm) >> } >> >> /* Allocate the HW PGD, making sure that each page gets its own >> refcount */ >> - pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO); >> + pgd = alloc_pages_exact(S2_PGD_TABLE_SIZE, GFP_KERNEL | __GFP_ZERO); >> if (!pgd) >> return -ENOMEM; >> >> @@ -843,7 +843,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm) >> spin_unlock(&kvm->mmu_lock); >> >> /* Free the HW pgd, one page at a time */ >> - free_pages_exact(kvm->arch.pgd, S2_PGD_SIZE); >> + free_pages_exact(kvm->arch.pgd, S2_PGD_TABLE_SIZE); >> kvm->arch.pgd = NULL; >> } >> > > Btw, I have a different hunk to solve the problem, posted to the kbuild > report. I will post it here for the sake of capturing the discussion in > one place. The following hunk on top of the patch, changes the lock > release after we process one PGDIR entry. As for the first time > we enter the loop we haven't done much with the lock held, hence it may > make > sense to do it after the first round and we have more work to do. > > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c > index db94f3a..582a972 100644 > --- a/arch/arm/kvm/mmu.c > +++ b/arch/arm/kvm/mmu.c > @@ -295,15 +295,15 @@ static void unmap_stage2_range(struct kvm *kvm, > phys_addr_t start, u64 size) > assert_spin_locked(&kvm->mmu_lock); > pgd = kvm->arch.pgd + stage2_pgd_index(addr); > do { > + next = stage2_pgd_addr_end(addr, end); > + if (!stage2_pgd_none(*pgd)) Just as heads up, I had this version applied to my tree by accident (commit 8b3405e345b5a098101b0c31b264c812bba045d9 from Christoffer's queue) and ran into a NULL pointer dereference: [223090.242280] Unable to handle kernel NULL pointer dereference at virtual address 00000040 [223090.262330] PC is at unmap_stage2_range+0x8c/0x428 [223090.262332] LR is at kvm_unmap_hva_handler+0x2c/0x3c [223090.262531] Call trace: [223090.262533] [] unmap_stage2_range+0x8c/0x428 [223090.262535] [] kvm_unmap_hva_handler+0x2c/0x3c [223090.262537] [] handle_hva_to_gpa+0xb0/0x104 [223090.262539] [] kvm_unmap_hva+0x5c/0xbc [223090.262543] [] kvm_mmu_notifier_invalidate_page+0x50/0x8c [223090.262547] [] __mmu_notifier_invalidate_page+0x5c/0x84 [223090.262551] [] try_to_unmap_one+0x1d0/0x4a0 [223090.262553] [] rmap_walk+0x1cc/0x2e0 [223090.262555] [] try_to_unmap+0x74/0xa4 [223090.262557] [] migrate_pages+0x31c/0x5ac [223090.262561] [] compact_zone+0x3fc/0x7ac [223090.262563] [] compact_zone_order+0x94/0xb0 [223090.262564] [] try_to_compact_pages+0x108/0x290 [223090.262569] [] __alloc_pages_direct_compact+0x70/0x1ac [223090.262571] [] __alloc_pages_nodemask+0x434/0x9f4 [223090.262572] [] alloc_pages_vma+0x230/0x254 [223090.262574] [] do_huge_pmd_anonymous_page+0x114/0x538 [223090.262576] [] handle_mm_fault+0xd40/0x17a4 [223090.262577] [] __get_user_pages+0x12c/0x36c [223090.262578] [] get_user_pages_unlocked+0xa4/0x1b8 [223090.262579] [] __gfn_to_pfn_memslot+0x280/0x31c [223090.262580] [] gfn_to_pfn_prot+0x4c/0x5c [223090.262582] [] kvm_handle_guest_abort+0x240/0x774 [223090.262584] [] handle_exit+0x11c/0x1ac [223090.262586] [] kvm_arch_vcpu_ioctl_run+0x31c/0x648 [223090.262587] [] kvm_vcpu_ioctl+0x378/0x768 [223090.262590] [] do_vfs_ioctl+0x324/0x5a4 [223090.262591] [] SyS_ioctl+0x90/0xa4 [223090.262595] [] el0_svc_naked+0x38/0x3c 0xffff0000080adb78 is in unmap_stage2_range (../arch/arm/kvm/mmu.c:260). 255 pud_t *pud, *start_pud; 256 257 start_pud = pud = stage2_pud_offset(pgd, addr); 258 do { 259 next = stage2_pud_addr_end(addr, end); 260 if (!stage2_pud_none(*pud)) { 261 if (stage2_pud_huge(*pud)) { 262 pud_t old_pud = *pud; 263 264 stage2_pud_clear(pud); So please beware that for some reason pud may become invalid after rescheduling. Alex