All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexander Graf <agraf@suse.de>
To: kvm-ppc@vger.kernel.org
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, mtosatti@redhat.com,
	Paul Mackerras <paulus@samba.org>
Subject: [PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
Date: Fri, 30 May 2014 14:42:52 +0200	[thread overview]
Message-ID: <1401453776-55285-38-git-send-email-agraf@suse.de> (raw)
In-Reply-To: <1401453776-55285-1-git-send-email-agraf@suse.de>

From: Paul Mackerras <paulus@samba.org>

Current, when testing whether a page is dirty (when constructing the
bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
in the HPT entries mapping the page, and if it is 0, we consider the
page to be clean.  However, the Power ISA doesn't require processors
to set the C bit to 1 immediately when writing to a page, and in fact
allows them to delay the writeback of the C bit until they receive a
TLB invalidation for the page.  Thus it is possible that the page
could be dirty and we miss it.

Now, if there are vcpus running, this is not serious since the
collection of the dirty log is racy already - some vcpu could dirty
the page just after we check it.  But if there are no vcpus running we
should return definitive results, in case we are in the final phase of
migrating the guest.

Also, if the permission bits in the HPTE don't allow writing, then we
know that no CPU can set C.  If the HPTE was previously writable and
the page was modified, any C bit writeback would have been flushed out
by the tlbie that we did when changing the HPTE to read-only.

Otherwise we need to do a TLB invalidation even if the C bit is 0, and
then check the C bit.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 +++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 96c9044..8056107 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1060,6 +1060,11 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte)
 	kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
 }
 
+static int vcpus_running(struct kvm *kvm)
+{
+	return atomic_read(&kvm->arch.vcpus_running) != 0;
+}
+
 /*
  * Returns the number of system pages that are dirty.
  * This can be more than 1 if we find a huge-page HPTE.
@@ -1069,6 +1074,7 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 	struct revmap_entry *rev = kvm->arch.revmap;
 	unsigned long head, i, j;
 	unsigned long n;
+	unsigned long v, r;
 	unsigned long *hptep;
 	int npages_dirty = 0;
 
@@ -1088,7 +1094,22 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 		hptep = (unsigned long *) (kvm->arch.hpt_virt + (i << 4));
 		j = rev[i].forw;
 
-		if (!(hptep[1] & HPTE_R_C))
+		/*
+		 * Checking the C (changed) bit here is racy since there
+		 * is no guarantee about when the hardware writes it back.
+		 * If the HPTE is not writable then it is stable since the
+		 * page can't be written to, and we would have done a tlbie
+		 * (which forces the hardware to complete any writeback)
+		 * when making the HPTE read-only.
+		 * If vcpus are running then this call is racy anyway
+		 * since the page could get dirtied subsequently, so we
+		 * expect there to be a further call which would pick up
+		 * any delayed C bit writeback.
+		 * Otherwise we need to do the tlbie even if C==0 in
+		 * order to pick up any delayed writeback of C.
+		 */
+		if (!(hptep[1] & HPTE_R_C) &&
+		    (!hpte_is_writable(hptep[1]) || vcpus_running(kvm)))
 			continue;
 
 		if (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) {
@@ -1100,23 +1121,29 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 		}
 
 		/* Now check and modify the HPTE */
-		if ((hptep[0] & HPTE_V_VALID) && (hptep[1] & HPTE_R_C)) {
-			/* need to make it temporarily absent to clear C */
-			hptep[0] |= HPTE_V_ABSENT;
-			kvmppc_invalidate_hpte(kvm, hptep, i);
-			hptep[1] &= ~HPTE_R_C;
-			eieio();
-			hptep[0] = (hptep[0] & ~HPTE_V_ABSENT) | HPTE_V_VALID;
+		if (!(hptep[0] & HPTE_V_VALID))
+			continue;
+
+		/* need to make it temporarily absent so C is stable */
+		hptep[0] |= HPTE_V_ABSENT;
+		kvmppc_invalidate_hpte(kvm, hptep, i);
+		v = hptep[0];
+		r = hptep[1];
+		if (r & HPTE_R_C) {
+			hptep[1] = r & ~HPTE_R_C;
 			if (!(rev[i].guest_rpte & HPTE_R_C)) {
 				rev[i].guest_rpte |= HPTE_R_C;
 				note_hpte_modification(kvm, &rev[i]);
 			}
-			n = hpte_page_size(hptep[0], hptep[1]);
+			n = hpte_page_size(v, r);
 			n = (n + PAGE_SIZE - 1) >> PAGE_SHIFT;
 			if (n > npages_dirty)
 				npages_dirty = n;
+			eieio();
 		}
-		hptep[0] &= ~HPTE_V_HVLOCK;
+		v &= ~(HPTE_V_ABSENT | HPTE_V_HVLOCK);
+		v |= HPTE_V_VALID;
+		hptep[0] = v;
 	} while ((i = j) != head);
 
 	unlock_rmap(rmapp);
-- 
1.8.1.4


WARNING: multiple messages have this Message-ID (diff)
From: Alexander Graf <agraf@suse.de>
To: kvm-ppc@vger.kernel.org
Cc: kvm@vger.kernel.org, pbonzini@redhat.com, mtosatti@redhat.com,
	Paul Mackerras <paulus@samba.org>
Subject: [PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
Date: Fri, 30 May 2014 12:42:52 +0000	[thread overview]
Message-ID: <1401453776-55285-38-git-send-email-agraf@suse.de> (raw)
In-Reply-To: <1401453776-55285-1-git-send-email-agraf@suse.de>

From: Paul Mackerras <paulus@samba.org>

Current, when testing whether a page is dirty (when constructing the
bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
in the HPT entries mapping the page, and if it is 0, we consider the
page to be clean.  However, the Power ISA doesn't require processors
to set the C bit to 1 immediately when writing to a page, and in fact
allows them to delay the writeback of the C bit until they receive a
TLB invalidation for the page.  Thus it is possible that the page
could be dirty and we miss it.

Now, if there are vcpus running, this is not serious since the
collection of the dirty log is racy already - some vcpu could dirty
the page just after we check it.  But if there are no vcpus running we
should return definitive results, in case we are in the final phase of
migrating the guest.

Also, if the permission bits in the HPTE don't allow writing, then we
know that no CPU can set C.  If the HPTE was previously writable and
the page was modified, any C bit writeback would have been flushed out
by the tlbie that we did when changing the HPTE to read-only.

Otherwise we need to do a TLB invalidation even if the C bit is 0, and
then check the C bit.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 47 +++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 96c9044..8056107 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1060,6 +1060,11 @@ void kvm_set_spte_hva_hv(struct kvm *kvm, unsigned long hva, pte_t pte)
 	kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
 }
 
+static int vcpus_running(struct kvm *kvm)
+{
+	return atomic_read(&kvm->arch.vcpus_running) != 0;
+}
+
 /*
  * Returns the number of system pages that are dirty.
  * This can be more than 1 if we find a huge-page HPTE.
@@ -1069,6 +1074,7 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 	struct revmap_entry *rev = kvm->arch.revmap;
 	unsigned long head, i, j;
 	unsigned long n;
+	unsigned long v, r;
 	unsigned long *hptep;
 	int npages_dirty = 0;
 
@@ -1088,7 +1094,22 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 		hptep = (unsigned long *) (kvm->arch.hpt_virt + (i << 4));
 		j = rev[i].forw;
 
-		if (!(hptep[1] & HPTE_R_C))
+		/*
+		 * Checking the C (changed) bit here is racy since there
+		 * is no guarantee about when the hardware writes it back.
+		 * If the HPTE is not writable then it is stable since the
+		 * page can't be written to, and we would have done a tlbie
+		 * (which forces the hardware to complete any writeback)
+		 * when making the HPTE read-only.
+		 * If vcpus are running then this call is racy anyway
+		 * since the page could get dirtied subsequently, so we
+		 * expect there to be a further call which would pick up
+		 * any delayed C bit writeback.
+		 * Otherwise we need to do the tlbie even if C=0 in
+		 * order to pick up any delayed writeback of C.
+		 */
+		if (!(hptep[1] & HPTE_R_C) &&
+		    (!hpte_is_writable(hptep[1]) || vcpus_running(kvm)))
 			continue;
 
 		if (!try_lock_hpte(hptep, HPTE_V_HVLOCK)) {
@@ -1100,23 +1121,29 @@ static int kvm_test_clear_dirty_npages(struct kvm *kvm, unsigned long *rmapp)
 		}
 
 		/* Now check and modify the HPTE */
-		if ((hptep[0] & HPTE_V_VALID) && (hptep[1] & HPTE_R_C)) {
-			/* need to make it temporarily absent to clear C */
-			hptep[0] |= HPTE_V_ABSENT;
-			kvmppc_invalidate_hpte(kvm, hptep, i);
-			hptep[1] &= ~HPTE_R_C;
-			eieio();
-			hptep[0] = (hptep[0] & ~HPTE_V_ABSENT) | HPTE_V_VALID;
+		if (!(hptep[0] & HPTE_V_VALID))
+			continue;
+
+		/* need to make it temporarily absent so C is stable */
+		hptep[0] |= HPTE_V_ABSENT;
+		kvmppc_invalidate_hpte(kvm, hptep, i);
+		v = hptep[0];
+		r = hptep[1];
+		if (r & HPTE_R_C) {
+			hptep[1] = r & ~HPTE_R_C;
 			if (!(rev[i].guest_rpte & HPTE_R_C)) {
 				rev[i].guest_rpte |= HPTE_R_C;
 				note_hpte_modification(kvm, &rev[i]);
 			}
-			n = hpte_page_size(hptep[0], hptep[1]);
+			n = hpte_page_size(v, r);
 			n = (n + PAGE_SIZE - 1) >> PAGE_SHIFT;
 			if (n > npages_dirty)
 				npages_dirty = n;
+			eieio();
 		}
-		hptep[0] &= ~HPTE_V_HVLOCK;
+		v &= ~(HPTE_V_ABSENT | HPTE_V_HVLOCK);
+		v |= HPTE_V_VALID;
+		hptep[0] = v;
 	} while ((i = j) != head);
 
 	unlock_rmap(rmapp);
-- 
1.8.1.4


  parent reply	other threads:[~2014-05-30 12:43 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-30 12:42 [PULL 00/41] ppc patch queue 2014-05-30 Alexander Graf
2014-05-30 12:42 ` Alexander Graf
2014-05-30 12:42 ` [PULL 01/41] KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 02/41] KVM: PPC: E500: Add dcbtls emulation Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 03/41] KVM: PPC: BOOK3S: PR: Enable Little Endian PR guest Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 04/41] KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 05/41] KVM: PPC: Book3S: PR: Fix C/R bit setting Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 06/41] KVM: PPC: Book3S_32: PR: Access HTAB in big endian Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 07/41] KVM: PPC: Book3S_64 " Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 08/41] KVM: PPC: Book3S_64 PR: Access shadow slb " Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 09/41] KVM: PPC: Book3S PR: Default to big endian guest Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 10/41] KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 11/41] KVM: PPC: Book3S PR: PAPR: Access RTAS " Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 12/41] KVM: PPC: PR: Fill pvinfo hcall instructions " Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 13/41] KVM: PPC: Make shared struct aka magic page guest endian Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 14/41] KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 15/41] KVM: PPC: Book3S: Move little endian conflict to HV KVM Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 16/41] KVM: PPC: Book3S PR: Ignore PMU SPRs Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 17/41] KVM: PPC: Book3S PR: Emulate TIR register Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 18/41] KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 19/41] KVM: PPC: Book3S PR: Expose TAR facility to guest Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 20/41] KVM: PPC: Book3S PR: Expose EBB registers Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 21/41] KVM: PPC: Book3S PR: Expose TM registers Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 22/41] KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 23/41] KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 24/41] KVM: PPC: Disable NX for old magic page using guests Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 25/41] PPC: KVM: Make NX bit available with magic page Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 26/41] KVM: PPC: BOOK3S: Always use the saved DAR value Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 27/41] KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 28/41] PPC: ePAPR: Fix hypercall on LE guest Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 29/41] KVM: PPC: Graciously fail broken LE hypercalls Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 30/41] KVM: PPC: MPIC: Reset IRQ source private members Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 31/41] KVM: PPC: Add CAP to indicate hcall fixes Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 32/41] KVM: PPC: Book3S: Add ONE_REG register names that were missed Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 33/41] KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 15:50   ` Paolo Bonzini
2014-05-30 15:50     ` Paolo Bonzini
2014-05-30 15:53     ` Alexander Graf
2014-05-30 15:53       ` Alexander Graf
2014-05-30 15:55       ` Paolo Bonzini
2014-05-30 15:55         ` Paolo Bonzini
2014-05-30 15:58         ` Alexander Graf
2014-05-30 15:58           ` Alexander Graf
2014-05-30 16:03           ` Paolo Bonzini
2014-05-30 16:03             ` Paolo Bonzini
2014-05-30 16:08             ` Alexander Graf
2014-05-30 16:08               ` Alexander Graf
2014-05-30 16:11               ` Paolo Bonzini
2014-05-30 16:11                 ` Paolo Bonzini
2014-05-30 16:14                 ` Alexander Graf
2014-05-30 16:14                   ` Alexander Graf
2014-05-30 12:42 ` [PULL 34/41] KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 35/41] KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 36/41] KVM: PPC: Book3S HV: Fix dirty map for hugepages Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` Alexander Graf [this message]
2014-05-30 12:42   ` [PULL 37/41] KVM: PPC: Book3S HV: Make sure we don't miss dirty pages Alexander Graf
2014-05-30 12:42 ` [PULL 38/41] KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 39/41] KVM: PPC: Book3S HV: Fix machine check delivery to guest Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 40/41] KVM: PPC: Book3S PR: Use SLB entry 0 Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:42 ` [PULL 41/41] KVM: PPC: Book3S PR: Rework SLB switching code Alexander Graf
2014-05-30 12:42   ` Alexander Graf
2014-05-30 12:58 ` [PULL 00/41] ppc patch queue 2014-05-30 Paolo Bonzini
2014-05-30 12:58   ` Paolo Bonzini
2014-05-30 13:10   ` Alexander Graf
2014-05-30 13:10     ` Alexander Graf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1401453776-55285-38-git-send-email-agraf@suse.de \
    --to=agraf@suse.de \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    --cc=paulus@samba.org \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.