From: Alexander Graf <agraf@suse.de> To: kvm-ppc@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, mtosatti@redhat.com, Paul Mackerras <paulus@samba.org> Subject: [PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages Date: Fri, 30 May 2014 14:42:52 +0200 [thread overview] Message-ID: <1401453776-55285-38-git-send-email-agraf@suse.de> (raw) In-Reply-To: <1401453776-55285-1-git-send-email-agraf@suse.de> From: Paul Mackerras <paulus@samba.org> Current, when testing whether a page is dirty (when constructing the bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit in the HPT entries mapping the page, and if it is 0, we consider the page to be clean. However, the Power ISA doesn't require processors to set the C bit to 1 immediately when writing to a page, and in fact allows them to delay the writeback of the C bit until they receive a TLB invalidation for the page. Thus it is possible that the page could be dirty and we miss it. Now, if there are vcpus running, this is not serious since the collection of the dirty log is racy already - some vcpu could dirty the page just after we check it. But if there are no vcpus running we should return definitive results, in case we are in the final phase of migrating the guest. Also, if the permission bits in the HPTE don't allow writing, then we know that no CPU can set C. If the HPTE was previously writable and the page was modified, any C bit writeback would have been flushed out by the tlbie that we did when changing the HPTE to read-only. Otherwise we need to do a TLB invalidation even if the C bit is 0, and then check the C bit. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 +++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 96c9044..8056107 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -1060,6 +1060,11 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte) kvm_handle_hva(kvm, hva, kvm_unmap_rmapp); } +static int vcpus_running(struct kvm *kvm) +{ + return atomic_read(&kvm->arch.vcpus_running) != 0; +} + /* * Returns the number of system pages that are dirty. * This can be more than 1 if we find a huge-page HPTE. @@ -1069,6 +1074,7 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) struct revmap_entry *rev = kvm->arch.revmap; unsigned long head, i, j; unsigned long n; + unsigned long v, r; unsigned long *hptep; int npages_dirty = 0; @@ -1088,7 +1094,22 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) hptep = (unsigned long *) (kvm->arch.hpt_virt + (i << 4)); j = rev[i].forw; - if (!(hptep[1] & HPTE_R_C)) + /* + * Checking the C (changed) bit here is racy since there + * is no guarantee about when the hardware writes it back. + * If the HPTE is not writable then it is stable since the + * page can't be written to, and we would have done a tlbie + * (which forces the hardware to complete any writeback) + * when making the HPTE read-only. + * If vcpus are running then this call is racy anyway + * since the page could get dirtied subsequently, so we + * expect there to be a further call which would pick up + * any delayed C bit writeback. + * Otherwise we need to do the tlbie even if C==0 in + * order to pick up any delayed writeback of C. + */ + if (!(hptep[1] & HPTE_R_C) && + (!hpte_is_writable(hptep[1]) || vcpus_running(kvm))) continue; if (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) { @@ -1100,23 +1121,29 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) } /* Now check and modify the HPTE */ - if ((hptep[0] & HPTE_V_VALID) && (hptep[1] & HPTE_R_C)) { - /* need to make it temporarily absent to clear C */ - hptep[0] |= HPTE_V_ABSENT; - kvmppc_invalidate_hpte(kvm, hptep, i); - hptep[1] &= ~HPTE_R_C; - eieio(); - hptep[0] = (hptep[0] & ~HPTE_V_ABSENT) | HPTE_V_VALID; + if (!(hptep[0] & HPTE_V_VALID)) + continue; + + /* need to make it temporarily absent so C is stable */ + hptep[0] |= HPTE_V_ABSENT; + kvmppc_invalidate_hpte(kvm, hptep, i); + v = hptep[0]; + r = hptep[1]; + if (r & HPTE_R_C) { + hptep[1] = r & ~HPTE_R_C; if (!(rev[i].guest_rpte & HPTE_R_C)) { rev[i].guest_rpte |= HPTE_R_C; note_hpte_modification(kvm, &rev[i]); } - n = hpte_page_size(hptep[0], hptep[1]); + n = hpte_page_size(v, r); n = (n + PAGE_SIZE - 1) >> PAGE_SHIFT; if (n > npages_dirty) npages_dirty = n; + eieio(); } - hptep[0] &= ~HPTE_V_HVLOCK; + v &= ~(HPTE_V_ABSENT | HPTE_V_HVLOCK); + v |= HPTE_V_VALID; + hptep[0] = v; } while ((i = j) != head); unlock_rmap(rmapp); -- 1.8.1.4
WARNING: multiple messages have this Message-ID (diff)
From: Alexander Graf <agraf@suse.de> To: kvm-ppc@vger.kernel.org Cc: kvm@vger.kernel.org, pbonzini@redhat.com, mtosatti@redhat.com, Paul Mackerras <paulus@samba.org> Subject: [PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages Date: Fri, 30 May 2014 12:42:52 +0000 [thread overview] Message-ID: <1401453776-55285-38-git-send-email-agraf@suse.de> (raw) In-Reply-To: <1401453776-55285-1-git-send-email-agraf@suse.de> From: Paul Mackerras <paulus@samba.org> Current, when testing whether a page is dirty (when constructing the bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit in the HPT entries mapping the page, and if it is 0, we consider the page to be clean. However, the Power ISA doesn't require processors to set the C bit to 1 immediately when writing to a page, and in fact allows them to delay the writeback of the C bit until they receive a TLB invalidation for the page. Thus it is possible that the page could be dirty and we miss it. Now, if there are vcpus running, this is not serious since the collection of the dirty log is racy already - some vcpu could dirty the page just after we check it. But if there are no vcpus running we should return definitive results, in case we are in the final phase of migrating the guest. Also, if the permission bits in the HPTE don't allow writing, then we know that no CPU can set C. If the HPTE was previously writable and the page was modified, any C bit writeback would have been flushed out by the tlbie that we did when changing the HPTE to read-only. Otherwise we need to do a TLB invalidation even if the C bit is 0, and then check the C bit. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de> --- arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 +++++++++++++++++++++++++++++-------- 1 file changed, 37 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 96c9044..8056107 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -1060,6 +1060,11 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte) kvm_handle_hva(kvm, hva, kvm_unmap_rmapp); } +static int vcpus_running(struct kvm *kvm) +{ + return atomic_read(&kvm->arch.vcpus_running) != 0; +} + /* * Returns the number of system pages that are dirty. * This can be more than 1 if we find a huge-page HPTE. @@ -1069,6 +1074,7 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) struct revmap_entry *rev = kvm->arch.revmap; unsigned long head, i, j; unsigned long n; + unsigned long v, r; unsigned long *hptep; int npages_dirty = 0; @@ -1088,7 +1094,22 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) hptep = (unsigned long *) (kvm->arch.hpt_virt + (i << 4)); j = rev[i].forw; - if (!(hptep[1] & HPTE_R_C)) + /* + * Checking the C (changed) bit here is racy since there + * is no guarantee about when the hardware writes it back. + * If the HPTE is not writable then it is stable since the + * page can't be written to, and we would have done a tlbie + * (which forces the hardware to complete any writeback) + * when making the HPTE read-only. + * If vcpus are running then this call is racy anyway + * since the page could get dirtied subsequently, so we + * expect there to be a further call which would pick up + * any delayed C bit writeback. + * Otherwise we need to do the tlbie even if C=0 in + * order to pick up any delayed writeback of C. + */ + if (!(hptep[1] & HPTE_R_C) && + (!hpte_is_writable(hptep[1]) || vcpus_running(kvm))) continue; if (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) { @@ -1100,23 +1121,29 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp) } /* Now check and modify the HPTE */ - if ((hptep[0] & HPTE_V_VALID) && (hptep[1] & HPTE_R_C)) { - /* need to make it temporarily absent to clear C */ - hptep[0] |= HPTE_V_ABSENT; - kvmppc_invalidate_hpte(kvm, hptep, i); - hptep[1] &= ~HPTE_R_C; - eieio(); - hptep[0] = (hptep[0] & ~HPTE_V_ABSENT) | HPTE_V_VALID; + if (!(hptep[0] & HPTE_V_VALID)) + continue; + + /* need to make it temporarily absent so C is stable */ + hptep[0] |= HPTE_V_ABSENT; + kvmppc_invalidate_hpte(kvm, hptep, i); + v = hptep[0]; + r = hptep[1]; + if (r & HPTE_R_C) { + hptep[1] = r & ~HPTE_R_C; if (!(rev[i].guest_rpte & HPTE_R_C)) { rev[i].guest_rpte |= HPTE_R_C; note_hpte_modification(kvm, &rev[i]); } - n = hpte_page_size(hptep[0], hptep[1]); + n = hpte_page_size(v, r); n = (n + PAGE_SIZE - 1) >> PAGE_SHIFT; if (n > npages_dirty) npages_dirty = n; + eieio(); } - hptep[0] &= ~HPTE_V_HVLOCK; + v &= ~(HPTE_V_ABSENT | HPTE_V_HVLOCK); + v |= HPTE_V_VALID; + hptep[0] = v; } while ((i = j) != head); unlock_rmap(rmapp); -- 1.8.1.4
next prev parent reply other threads:[~2014-05-30 12:43 UTC|newest] Thread overview: 104+ messages / expand[flat|nested] mbox.gz Atom feed top 2014-05-30 12:42 [PULL 00/41] ppc patch queue 2014-05-30 Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 01/41] KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 02/41] KVM: PPC: E500: Add dcbtls emulation Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 03/41] KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 04/41] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 05/41] KVM: PPC: Book3S: PR: Fix C/R bit setting Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 06/41] KVM: PPC: Book3S_32: PR: Access HTAB in big endian Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 07/41] KVM: PPC: Book3S_64 " Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 08/41] KVM: PPC: Book3S_64 PR: Access shadow slb " Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 09/41] KVM: PPC: Book3S PR: Default to big endian guest Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 10/41] KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 11/41] KVM: PPC: Book3S PR: PAPR: Access RTAS " Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 12/41] KVM: PPC: PR: Fill pvinfo hcall instructions " Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 13/41] KVM: PPC: Make shared struct aka magic page guest endian Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 14/41] KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 15/41] KVM: PPC: Book3S: Move little endian conflict to HV KVM Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 16/41] KVM: PPC: Book3S PR: Ignore PMU SPRs Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 17/41] KVM: PPC: Book3S PR: Emulate TIR register Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 18/41] KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 19/41] KVM: PPC: Book3S PR: Expose TAR facility to guest Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 20/41] KVM: PPC: Book3S PR: Expose EBB registers Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 21/41] KVM: PPC: Book3S PR: Expose TM registers Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 22/41] KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 23/41] KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 24/41] KVM: PPC: Disable NX for old magic page using guests Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 25/41] PPC: KVM: Make NX bit available with magic page Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 26/41] KVM: PPC: BOOK3S: Always use the saved DAR value Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 27/41] KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 28/41] PPC: ePAPR: Fix hypercall on LE guest Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 29/41] KVM: PPC: Graciously fail broken LE hypercalls Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 30/41] KVM: PPC: MPIC: Reset IRQ source private members Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 31/41] KVM: PPC: Add CAP to indicate hcall fixes Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 32/41] KVM: PPC: Book3S: Add ONE_REG register names that were missed Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 15:50 ` Paolo Bonzini 2014-05-30 15:50 ` Paolo Bonzini 2014-05-30 15:53 ` Alexander Graf 2014-05-30 15:53 ` Alexander Graf 2014-05-30 15:55 ` Paolo Bonzini 2014-05-30 15:55 ` Paolo Bonzini 2014-05-30 15:58 ` Alexander Graf 2014-05-30 15:58 ` Alexander Graf 2014-05-30 16:03 ` Paolo Bonzini 2014-05-30 16:03 ` Paolo Bonzini 2014-05-30 16:08 ` Alexander Graf 2014-05-30 16:08 ` Alexander Graf 2014-05-30 16:11 ` Paolo Bonzini 2014-05-30 16:11 ` Paolo Bonzini 2014-05-30 16:14 ` Alexander Graf 2014-05-30 16:14 ` Alexander Graf 2014-05-30 12:42 ` [PULL 34/41] KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 35/41] KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 36/41] KVM: PPC: Book3S HV: Fix dirty map for hugepages Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` Alexander Graf [this message] 2014-05-30 12:42 ` [PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages Alexander Graf 2014-05-30 12:42 ` [PULL 38/41] KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 39/41] KVM: PPC: Book3S HV: Fix machine check delivery to guest Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 40/41] KVM: PPC: Book3S PR: Use SLB entry 0 Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:42 ` [PULL 41/41] KVM: PPC: Book3S PR: Rework SLB switching code Alexander Graf 2014-05-30 12:42 ` Alexander Graf 2014-05-30 12:58 ` [PULL 00/41] ppc patch queue 2014-05-30 Paolo Bonzini 2014-05-30 12:58 ` Paolo Bonzini 2014-05-30 13:10 ` Alexander Graf 2014-05-30 13:10 ` Alexander Graf
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1401453776-55285-38-git-send-email-agraf@suse.de \ --to=agraf@suse.de \ --cc=kvm-ppc@vger.kernel.org \ --cc=kvm@vger.kernel.org \ --cc=mtosatti@redhat.com \ --cc=paulus@samba.org \ --cc=pbonzini@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.