[PATCH v2] kvm/selftests: Fix race condition with dirty_log_test

From: Peter Xu <peterx@redhat.com>
To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: peterx@redhat.com, Andrew Jones <drjones@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Sean Christopherson <seanjc@google.com>
Subject: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test
Date: Tue, 13 Apr 2021 17:36:41 -0400	[thread overview]
Message-ID: <20210413213641.23742-1-peterx@redhat.com> (raw)

This fixes a bug that can trigger with e.g. "taskset -c 0 ./dirty_log_test" or
when the testing host is very busy.

The issue is when the vcpu thread got the dirty bit set but got preempted by
other threads _before_ the data is written, we won't be able to see the latest
data only until the vcpu threads do VMENTER. IOW, the guest write operation and
dirty bit set cannot guarantee atomicity. The race could look like:

    main thread                            vcpu thread
    ===========                            ===========
    iteration=X
                                           *addr = X
                                           (so latest data is X)
    iteration=X+1
    ...
    iteration=X+N
                                           guest executes "*addr = X+N"
                                             reg=READ_ONCE(iteration)=X+N
                                             host page fault
                                               set dirty bit for page "addr"
                                             (_before_ VMENTER happens...
                                              so *addr is still X!)
                                           vcpu thread got preempted
    get dirty log
    verify data
      detected dirty bit set, data is X
      not X+N nor X+N-1, data too old!

This patch closes this race by allowing the main thread to give the vcpu thread
chance to do a VMENTER to complete that write operation.  It's done by adding a
vcpu loop counter (must be defined as volatile as main thread will do read
loop), then the main thread can guarantee the vcpu got at least another VMENTER
by making sure the guest_vcpu_loops increases by 2.

Dirty ring does not need this since dirty_ring_last_page would already help
avoid this specific race condition.

Cc: Andrew Jones <drjones@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v2:
- drop one unnecessary check on "!matched"
---
 tools/testing/selftests/kvm/dirty_log_test.c | 53 +++++++++++++++++++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c b/tools/testing/selftests/kvm/dirty_log_test.c
index bb2752d78fe3..b639f088bb86 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -160,6 +160,10 @@ static bool dirty_ring_vcpu_ring_full;
  * verifying process, we let it pass.
  */
 static uint64_t dirty_ring_last_page;
+/*
+ * Record how many loops the guest vcpu thread has went through
+ */
+static volatile uint64_t guest_vcpu_loops;
 
 enum log_mode_t {
 	/* Only use KVM_GET_DIRTY_LOG for logging */
@@ -526,6 +530,7 @@ static void *vcpu_worker(void *data)
 			assert(sig == SIG_IPI);
 		}
 		log_mode_after_vcpu_run(vm, ret, errno);
+		guest_vcpu_loops++;
 	}
 
 	pr_info("Dirtied %"PRIu64" pages\n", pages_count);
@@ -553,6 +558,7 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 		}
 
 		if (test_and_clear_bit_le(page, bmap)) {
+			uint64_t current_loop;
 			bool matched;
 
 			host_dirty_count++;
@@ -565,7 +571,12 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 			matched = (*value_ptr == iteration ||
 				   *value_ptr == iteration - 1);
 
-			if (host_log_mode == LOG_MODE_DIRTY_RING && !matched) {
+			if (matched)
+				continue;
+
+			/* Didn't match..  Let's figure out what's wrong.. */
+			switch (host_log_mode) {
+			case LOG_MODE_DIRTY_RING:
 				if (*value_ptr == iteration - 2 && min_iter <= iteration - 2) {
 					/*
 					 * Short answer: this case is special
@@ -608,6 +619,46 @@ static void vm_dirty_log_verify(enum vm_guest_mode mode, unsigned long *bmap)
 					 */
 					continue;
 				}
+				break;
+			case LOG_MODE_DIRTY_LOG:
+			case LOG_MODE_CLEAR_LOG:
+				/*
+				 * This fixes a bug that can trigger with
+				 * e.g. "taskset -c 0 ./dirty_log_test" or when
+				 * the testing host is very busy, when the vcpu
+				 * thread got the dirty bit set but got
+				 * preempted by other threads _before_ the data
+				 * is written, so we won't be able to see the
+				 * latest data only until the vcpu threads do
+				 * VMENTER and the write finally lands to the
+				 * memory.  So when !matched happened, we give
+				 * the vcpu thread _one_ more chance to do a
+				 * VMENTER so as to flush the written data.  We
+				 * do that by observing guest_vcpu_loops to
+				 * increase +2: as +1 is not enough to
+				 * guarantee a complete VMENTER.
+				 *
+				 * Dirty ring does not need this since
+				 * dirty_ring_last_page would already help
+				 * avoid it.
+				 */
+				current_loop = guest_vcpu_loops;
+
+				/*
+				 * Wait until the vcpu thread at least
+				 * completes one VMENTER again.  the usleep()
+				 * gives the vcpu thread a better chance to be
+				 * scheduled earlier.
+				 */
+				while (guest_vcpu_loops <= current_loop + 2)
+					usleep(1);
+				/* Recalculate */
+				matched = (*value_ptr == iteration ||
+					   *value_ptr == iteration - 1);
+				break;
+			default:
+				/* Just to avoid some strict compile warning */
+				break;
 			}
 
 			TEST_ASSERT(matched,
-- 
2.26.2