From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D6BFC433F5 for ; Fri, 29 Apr 2022 21:01:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1381240AbiD2VEa (ORCPT ); Fri, 29 Apr 2022 17:04:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37584 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1381180AbiD2VEB (ORCPT ); Fri, 29 Apr 2022 17:04:01 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C5EF2D3AD4 for ; Fri, 29 Apr 2022 14:00:41 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id b20-20020a62a114000000b0050a6280e374so4689071pff.13 for ; Fri, 29 Apr 2022 14:00:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:in-reply-to:message-id:mime-version:references :subject:from:to:cc:content-transfer-encoding; bh=1jXnViPM/GAX2YPeMIk6CkyTEpsReANrOuuDgwYMTqs=; b=JdWl4x4Lwl57hshR6k9VgtJ8UR/iGgpODhmNu57jPxg9GGzwuyHqfYrq9Q78pDbHaF OAYfDzMC1t+PpXOO9ziLUDfKJiOeEKGvRbPmZwz7q5kJukXbbCCDUbjB3eOPcff+eiGC C7YgPJBbQ1BU+6UPbeT2FWBZa1ygGsO0DbZIrtUc/y4Y1ww2k8jYDKEBMQLCL70zeT1k o1zg/vmAEUxO9nGZlfk8uNC3M+nxyqiMy7oC8+hg0V/HqT6TjtB2XcDqugyUDvxOi1U1 B21msdvgaXDQyxUeYau4F/4DIMZrZGhb0TxFel1pPPRINNeDXw6lFX9/RdvxM8Klix6l ZCCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:in-reply-to:message-id :mime-version:references:subject:from:to:cc :content-transfer-encoding; bh=1jXnViPM/GAX2YPeMIk6CkyTEpsReANrOuuDgwYMTqs=; b=4rlANrhUhQToSzsKmZ2tlmixleGvSWAYbKoX3+8YGB0INJcNQ6eoVe97NEnceKLrtP 8+/Txw+DGfdLXhzQbzJKqAQ2QyPB2udvLlh6faVDpEjchblZ5FV8ZJA3GwF2Nd9Epkk4 pSy023Bhz9cCtJC0+oCcv3dFiT7+2D02r2vvZrXjd2IAvAa+EpQostWwoHMp7tYUsuJS JEve1pv7YQqt8YR/RDFiq1PK56rQQ3fh0dkp1aJho28Rl4h9ybi7bDDG99lrbcC0XMT3 OFxx8JUtfzCi9dinYURLSBPz+zw7XryfTw56Veld+T9l0vh7gIsHSnR8sL7T6RKKovij +otw== X-Gm-Message-State: AOAM530Vcb+3/3iM9EhFOZyFzffPW/b0gf/cX/8KUgQTzEhTh9dibx0Y H3FkI+/v6hnYlhyw5oFm2T/7Qayc+eo= X-Google-Smtp-Source: ABdhPJyRLCi09J85/dEsY7xZ4AgF5tTPIASJu6vO288KUFu2eJXRWXEcMmeu79JvNLLkJQsgOcd6Zw1Pxxk= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:902:b7c1:b0:15b:3c09:3ed3 with SMTP id v1-20020a170902b7c100b0015b3c093ed3mr1146005plz.73.1651266041292; Fri, 29 Apr 2022 14:00:41 -0700 (PDT) Reply-To: Sean Christopherson Date: Fri, 29 Apr 2022 21:00:24 +0000 In-Reply-To: <20220429210025.3293691-1-seanjc@google.com> Message-Id: <20220429210025.3293691-8-seanjc@google.com> Mime-Version: 1.0 References: <20220429210025.3293691-1-seanjc@google.com> X-Mailer: git-send-email 2.36.0.464.gb9c8b46e94-goog Subject: [PATCH v3 7/8] KVM: Fix multiple races in gfn=>pfn cache refresh From: Sean Christopherson To: Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Sean Christopherson , Lai Jiangshan , David Woodhouse , Mingwei Zhang Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Rework the gfn=3D>pfn cache (gpc) refresh logic to address multiple races between the cache itself, and between the cache and mmu_notifier events. The existing refresh code attempts to guard against races with the mmu_notifier by speculatively marking the cache valid, and then marking it invalid if a mmu_notifier invalidation occurs. That handles the case where an invalidation occurs between dropping and re-acquiring gpc->lock, but it doesn't handle the scenario where the cache is refreshed after the cache was invalidated by the notifier, but before the notifier elevates mmu_notifier_count. The gpc refresh can't use the "retry" helper as its invalidation occurs _before_ mmu_notifier_count is elevated and before mmu_notifier_range_start is set/updated. CPU0 CPU1 ---- ---- gfn_to_pfn_cache_invalidate_start() | -> gpc->valid =3D false; kvm_gfn_to_pfn_cache_refresh() | |-> gpc->valid =3D true; hva_to_pfn_retry() | -> acquire kvm->mmu_lock kvm->mmu_notifier_count =3D=3D= 0 mmu_seq =3D=3D kvm->mmu_notifi= er_seq drop kvm->mmu_lock return pfn 'X' acquire kvm->mmu_lock kvm_inc_notifier_count() drop kvm->mmu_lock() kernel frees pfn 'X' kvm_gfn_to_pfn_cache_check() | |-> gpc->valid =3D=3D true caller accesses freed pfn 'X' Key off of mn_active_invalidate_count to detect that a pfncache refresh needs to wait for an in-progress mmu_notifier invalidation. While mn_active_invalidate_count is not guaranteed to be stable, it is guaranteed to be elevated prior to an invalidation acquiring gpc->lock, so either the refresh will see an active invalidation and wait, or the invalidation will run after the refresh completes. Speculatively marking the cache valid is itself flawed, as a concurrent kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva values. The KVM Xen use case explicitly allows/wants multiple users; even though the caches are allocated per vCPU, __kvm_xen_has_interrupt() can read a different vCPU (or vCPUs). Address this race by invalidating the cache prior to dropping gpc->lock (this is made possible by fixing the above mmu_notifier race). Complicating all of this is the fact that both the hva=3D>pfn resolution and mapping of the kernel address can sleep, i.e. must be done outside of gpc->lock. Fix the above races in one fell swoop, trying to fix each individual race is largely pointless and essentially impossible to test, e.g. closing one hole just shifts the focus to the other hole. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation sup= port") Cc: stable@vger.kernel.org Cc: David Woodhouse Cc: Mingwei Zhang Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 9 +++ virt/kvm/pfncache.c | 193 ++++++++++++++++++++++++++++---------------- 2 files changed, 131 insertions(+), 71 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index dfb7dabdbc63..0848430f36c6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -705,6 +705,15 @@ static int kvm_mmu_notifier_invalidate_range_start(str= uct mmu_notifier *mn, kvm->mn_active_invalidate_count++; spin_unlock(&kvm->mn_invalidate_lock); =20 + /* + * Invalidate pfn caches _before_ invalidating the secondary MMUs, i.e. + * before acquiring mmu_lock, to avoid holding mmu_lock while acquiring + * each cache's lock. There are relatively few caches in existence at + * any given time, and the caches themselves can check for hva overlap, + * i.e. don't need to rely on memslot overlap checks for performance. + * Because this runs without holding mmu_lock, the pfn caches must use + * mn_active_invalidate_count (see above) instead of mmu_notifier_count. + */ gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, hva_range.may_block); =20 diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index eaef31462bbe..0535669ea2a1 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -112,31 +112,122 @@ static void gpc_release_pfn_and_khva(struct kvm *kvm= , kvm_pfn_t pfn, void *khva) } } =20 -static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva) +static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache= *gpc) { + /* Note, the new page offset may be different than the old! */ + void *old_khva =3D gpc->khva - offset_in_page(gpc->khva); + kvm_pfn_t new_pfn =3D KVM_PFN_ERR_FAULT; + void *new_khva =3D NULL; unsigned long mmu_seq; - kvm_pfn_t new_pfn; - int retry; =20 - do { + lockdep_assert_held_write(&gpc->lock); + + /* + * Invalidate the cache prior to dropping gpc->lock, the gpa=3D>uhva + * assets have already been updated and so a concurrent check() from a + * different task may not fail the gpa/uhva/generation checks. + */ + gpc->valid =3D false; + + for (;;) { mmu_seq =3D kvm->mmu_notifier_seq; smp_rmb(); =20 + write_unlock_irq(&gpc->lock); + + /* + * If the previous iteration "failed" due to an mmu_notifier + * event, release the pfn and unmap the kernel virtual address + * from the previous attempt. Unmapping might sleep, so this + * needs to be done after dropping the lock. Opportunistically + * check for resched while the lock isn't held. + */ + if (new_pfn !=3D KVM_PFN_ERR_FAULT) { + /* + * Keep the mapping if the previous iteration reused + * the existing mapping and didn't create a new one. + */ + if (new_khva =3D=3D old_khva) + new_khva =3D NULL; + + gpc_release_pfn_and_khva(kvm, new_pfn, new_khva); + + cond_resched(); + } + /* We always request a writeable mapping */ - new_pfn =3D hva_to_pfn(uhva, false, NULL, true, NULL); + new_pfn =3D hva_to_pfn(gpc->uhva, false, NULL, true, NULL); if (is_error_noslot_pfn(new_pfn)) - break; + goto out_error; + + /* + * Obtain a new kernel mapping if KVM itself will access the + * pfn. Note, kmap() and memremap() can both sleep, so this + * too must be done outside of gpc->lock! + */ + if (gpc->usage & KVM_HOST_USES_PFN) { + if (new_pfn =3D=3D gpc->pfn) { + new_khva =3D old_khva; + } else if (pfn_valid(new_pfn)) { + new_khva =3D kmap(pfn_to_page(new_pfn)); +#ifdef CONFIG_HAS_IOMEM + } else { + new_khva =3D memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB); +#endif + } + if (!new_khva) { + kvm_release_pfn_clean(new_pfn); + goto out_error; + } + } + + write_lock_irq(&gpc->lock); =20 - KVM_MMU_READ_LOCK(kvm); - retry =3D mmu_notifier_retry_hva(kvm, mmu_seq, uhva); - KVM_MMU_READ_UNLOCK(kvm); - if (!retry) + /* + * Other tasks must wait for _this_ refresh to complete before + * attempting to refresh. + */ + WARN_ON_ONCE(gpc->valid); + + /* + * Wait for mn_active_invalidate_count, not mmu_notifier_count, + * to go away, as the invalidation in the mmu_notifier event + * occurs _before_ mmu_notifier_count is elevated. + * + * Note, mn_active_invalidate_count can change at any time as + * it's not protected by gpc->lock. But, it is guaranteed to + * be elevated before the mmu_notifier acquires gpc->lock, and + * isn't dropped until after mmu_notifier_seq is updated. So, + * this task may get a false positive of sorts, i.e. see an + * elevated count and wait even though it's technically safe to + * proceed (becase the mmu_notifier will invalidate the cache + * _after_ it's refreshed here), but the cache will never be + * refreshed with stale data, i.e. won't get false negatives. + */ + if (kvm->mn_active_invalidate_count) + continue; + + /* + * Ensure mn_active_invalidate_count is read before + * mmu_notifier_seq. This pairs with the smp_wmb() in + * mmu_notifier_invalidate_range_end() to guarantee either the + * old (non-zero) value of mn_active_invalidate_count or the + * new (incremented) value of mmu_notifier_seq is observed. + */ + smp_rmb(); + if (kvm->mmu_notifier_seq =3D=3D mmu_seq) break; + } + + gpc->valid =3D true; + gpc->pfn =3D new_pfn; + gpc->khva =3D new_khva + (gpc->gpa & ~PAGE_MASK); + return 0; =20 - cond_resched(); - } while (1); +out_error: + write_lock_irq(&gpc->lock); =20 - return new_pfn; + return -EFAULT; } =20 int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache = *gpc, @@ -147,7 +238,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struc= t gfn_to_pfn_cache *gpc, kvm_pfn_t old_pfn, new_pfn; unsigned long old_uhva; void *old_khva; - bool old_valid; int ret =3D 0; =20 /* @@ -169,7 +259,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struc= t gfn_to_pfn_cache *gpc, old_pfn =3D gpc->pfn; old_khva =3D gpc->khva - offset_in_page(gpc->khva); old_uhva =3D gpc->uhva; - old_valid =3D gpc->valid; =20 /* If the userspace HVA is invalid, refresh that first */ if (gpc->gpa !=3D gpa || gpc->generation !=3D slots->generation || @@ -182,7 +271,6 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struc= t gfn_to_pfn_cache *gpc, gpc->uhva =3D gfn_to_hva_memslot(gpc->memslot, gfn); =20 if (kvm_is_error_hva(gpc->uhva)) { - gpc->pfn =3D KVM_PFN_ERR_FAULT; ret =3D -EFAULT; goto out; } @@ -192,60 +280,8 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, stru= ct gfn_to_pfn_cache *gpc, * If the userspace HVA changed or the PFN was already invalid, * drop the lock and do the HVA to PFN lookup again. */ - if (!old_valid || old_uhva !=3D gpc->uhva) { - unsigned long uhva =3D gpc->uhva; - void *new_khva =3D NULL; - - /* Placeholders for "hva is valid but not yet mapped" */ - gpc->pfn =3D KVM_PFN_ERR_FAULT; - gpc->khva =3D NULL; - gpc->valid =3D true; - - write_unlock_irq(&gpc->lock); - - new_pfn =3D hva_to_pfn_retry(kvm, uhva); - if (is_error_noslot_pfn(new_pfn)) { - ret =3D -EFAULT; - goto map_done; - } - - if (gpc->usage & KVM_HOST_USES_PFN) { - if (new_pfn =3D=3D old_pfn) { - /* - * Reuse the existing pfn and khva, but put the - * reference acquired hva_to_pfn_retry(); the - * cache still holds a reference to the pfn - * from the previous refresh. - */ - gpc_release_pfn_and_khva(kvm, new_pfn, NULL); - - new_khva =3D old_khva; - old_pfn =3D KVM_PFN_ERR_FAULT; - old_khva =3D NULL; - } else if (pfn_valid(new_pfn)) { - new_khva =3D kmap(pfn_to_page(new_pfn)); -#ifdef CONFIG_HAS_IOMEM - } else { - new_khva =3D memremap(pfn_to_hpa(new_pfn), PAGE_SIZE, MEMREMAP_WB); -#endif - } - if (new_khva) - new_khva +=3D page_offset; - else - ret =3D -EFAULT; - } - - map_done: - write_lock_irq(&gpc->lock); - if (ret) { - gpc->valid =3D false; - gpc->pfn =3D KVM_PFN_ERR_FAULT; - gpc->khva =3D NULL; - } else { - /* At this point, gpc->valid may already have been cleared */ - gpc->pfn =3D new_pfn; - gpc->khva =3D new_khva; - } + if (!gpc->valid || old_uhva !=3D gpc->uhva) { + ret =3D hva_to_pfn_retry(kvm, gpc); } else { /* If the HVA=E2=86=92PFN mapping was already valid, don't unmap it. */ old_pfn =3D KVM_PFN_ERR_FAULT; @@ -253,11 +289,26 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, str= uct gfn_to_pfn_cache *gpc, } =20 out: + /* + * Invalidate the cache and purge the pfn/khva if the refresh failed. + * Some/all of the uhva, gpa, and memslot generation info may still be + * valid, leave it as is. + */ + if (ret) { + gpc->valid =3D false; + gpc->pfn =3D KVM_PFN_ERR_FAULT; + gpc->khva =3D NULL; + } + + /* Snapshot the new pfn before dropping the lock! */ + new_pfn =3D gpc->pfn; + write_unlock_irq(&gpc->lock); =20 mutex_unlock(&gpc->refresh_lock); =20 - gpc_release_pfn_and_khva(kvm, old_pfn, old_khva); + if (old_pfn !=3D new_pfn) + gpc_release_pfn_and_khva(kvm, old_pfn, old_khva); =20 return ret; } --=20 2.36.0.464.gb9c8b46e94-goog