All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/4] MAINTAINERS: Add KVM x86/xen maintainer list
@ 2022-11-19  9:46 David Woodhouse
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-19  9:46 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

From: David Woodhouse <dwmw@amazon.co.uk>

Adding Paul as co-maintainer of Xen support to help ensure that things
don't fall through the cracks when I spend three months at a time
travelling...

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 MAINTAINERS | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 046ff06ff97f..89672a59c0c3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11324,6 +11324,16 @@ F:	arch/x86/kvm/svm/hyperv.*
 F:	arch/x86/kvm/svm/svm_onhyperv.*
 F:	arch/x86/kvm/vmx/evmcs.*
 
+KVM X86 Xen (KVM/Xen)
+M:	David Woodhouse <dwmw2@infradead.org>
+M:	Paul Durrant <paul@xen.org>
+M:	Sean Christopherson <seanjc@google.com>
+M:	Paolo Bonzini <pbonzini@redhat.com>
+L:	kvm@vger.kernel.org
+S:	Supported
+T:	git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
+F:	arch/x86/kvm/xen.*
+
 KERNFS
 M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 M:	Tejun Heo <tj@kernel.org>
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-19  9:46 [PATCH 1/4] MAINTAINERS: Add KVM x86/xen maintainer list David Woodhouse
@ 2022-11-19  9:46 ` David Woodhouse
  2022-11-19 11:38   ` Durrant, Paul
                     ` (3 more replies)
  2022-11-19  9:46 ` [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page David Woodhouse
  2022-11-19  9:46 ` [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary David Woodhouse
  2 siblings, 4 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-19  9:46 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

From: David Woodhouse <dwmw@amazon.co.uk>

The guest runstate area can be arbitrarily byte-aligned. In fact, even
when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
fields in the structure end up being unaligned due to the fact that the
32-bit ABI only aligns them to 32 bits.

So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
is buggy, because if it's unaligned then we can't update the whole field
atomically; the low bytes might be observable before the _UPDATE bit is.
Xen actually updates the *byte* containing that top bit, on its own. KVM
should do the same.

In addition, we cannot assume that the runstate area fits within a single
page. One option might be to make the gfn_to_pfn cache cope with regions
that cross a page — but getting a contiguous virtual kernel mapping of a
discontiguous set of IOMEM pages is a distinctly non-trivial exercise,
and it seems this is the *only* current use case for the GPC which would
benefit from it.

An earlier version of the runstate code did use a gfn_to_hva cache for
this purpose, but it still had the single-page restriction because it
used the uhva directly — because it needs to be able to do so atomically
when the vCPU is being scheduled out, so it used pagefault_disable()
around the accesses and didn't just use kvm_write_guest_cached() which
has a fallback path.

So... use a pair of GPCs for the first and potential second page covering
the runstate area. We can get away with locking both at once because
nothing else takes more than one GPC lock at a time so we can invent
a trivial ordering rule.

Keep the trivial fast path for the common case where it's all in the
same page, but fixed to use a byte access for the XEN_RUNSTATE_UPDATE
bit. And in the cross-page case, build the structure locally on the
stack and then copy it over with two memcpy() calls, again handling
the XEN_RUNSTATE_UPDATE bit through a single byte access.

Finally, Xen also does write the runstate area immediately when it's
configured. Flip the kvm_xen_update_runstate() and …_guest() functions
and call the latter directly when the runstate area is set. This means
that other ioctls which modify the runstate also write it immediately
to the guest when they do so, which is also intended.

Update the xen_shinfo_test to exercise the pathological case where the
XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
actually in a different page to the rest of the 64-bit word.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h               |   1 +
 arch/x86/kvm/xen.c                            | 367 +++++++++++++-----
 arch/x86/kvm/xen.h                            |   6 +-
 .../selftests/kvm/x86_64/xen_shinfo_test.c    |  12 +-
 4 files changed, 272 insertions(+), 114 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d1013c4f673c..70af7240a1d5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -686,6 +686,7 @@ struct kvm_vcpu_xen {
 	struct gfn_to_pfn_cache vcpu_info_cache;
 	struct gfn_to_pfn_cache vcpu_time_info_cache;
 	struct gfn_to_pfn_cache runstate_cache;
+	struct gfn_to_pfn_cache runstate2_cache;
 	u64 last_steal;
 	u64 runstate_entry_time;
 	u64 runstate_times[4];
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 4b8e9628fbf5..8aa953b1f0e0 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -170,148 +170,269 @@ static void kvm_xen_init_timer(struct kvm_vcpu *vcpu)
 	vcpu->arch.xen.timer.function = xen_timer_callback;
 }
 
-static void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
+static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 {
 	struct kvm_vcpu_xen *vx = &v->arch.xen;
-	u64 now = get_kvmclock_ns(v->kvm);
-	u64 delta_ns = now - vx->runstate_entry_time;
-	u64 run_delay = current->sched_info.run_delay;
+	struct gfn_to_pfn_cache *gpc1 = &vx->runstate_cache;
+	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
+	size_t user_len, user_len1, user_len2;
+	struct vcpu_runstate_info rs;
+	int *rs_state = &rs.state;
+	unsigned long flags;
+	size_t times_ofs;
+	u8 *update_bit;
 
-	if (unlikely(!vx->runstate_entry_time))
-		vx->current_runstate = RUNSTATE_offline;
+	/*
+	 * The only difference between 32-bit and 64-bit versions of the
+	 * runstate struct us the alignment of uint64_t in 32-bit, which
+	 * means that the 64-bit version has an additional 4 bytes of
+	 * padding after the first field 'state'.
+	 */
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
+	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
+	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
+#ifdef CONFIG_X86_64
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
+		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
+#endif
+
+	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
+		user_len = sizeof(struct vcpu_runstate_info);
+		times_ofs = offsetof(struct vcpu_runstate_info,
+				     state_entry_time);
+	} else {
+		user_len = sizeof(struct compat_vcpu_runstate_info);
+		times_ofs = offsetof(struct compat_vcpu_runstate_info,
+				     state_entry_time);
+		rs_state++;
+	}
 
 	/*
-	 * Time waiting for the scheduler isn't "stolen" if the
-	 * vCPU wasn't running anyway.
+	 * There are basically no alignment constraints. The guest can set it
+	 * up so it crosses from one page to the next, and at arbitrary byte
+	 * alignment (and the 32-bit ABI doesn't align the 64-bit integers
+	 * anyway, even if the overall struct had been 64-bit aligned).
 	 */
-	if (vx->current_runstate == RUNSTATE_running) {
-		u64 steal_ns = run_delay - vx->last_steal;
+	if ((gpc1->gpa & ~PAGE_MASK) + user_len >= PAGE_SIZE) {
+		user_len1 = PAGE_SIZE - (gpc1->gpa & ~PAGE_MASK);
+		user_len2 = user_len - user_len1;
+	} else {
+		user_len1 = user_len;
+		user_len2 = 0;
+	}
+	BUG_ON(user_len1 + user_len2 != user_len);
 
-		delta_ns -= steal_ns;
+ retry:
+	/*
+	 * Attempt to obtain the GPC lock on *both* (if there are two)
+	 * gfn_to_pfn caches that cover the region.
+	 */
+	read_lock_irqsave(&gpc1->lock, flags);
+	while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc1, gpc1->gpa, user_len1)) {
+		read_unlock_irqrestore(&gpc1->lock, flags);
 
-		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
+		/* When invoked from kvm_sched_out() we cannot sleep */
+		if (atomic)
+			return;
+
+		if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc1, gpc1->gpa, user_len1))
+			return;
+
+		read_lock_irqsave(&gpc1->lock, flags);
 	}
-	vx->last_steal = run_delay;
 
-	vx->runstate_times[vx->current_runstate] += delta_ns;
-	vx->current_runstate = state;
-	vx->runstate_entry_time = now;
-}
+	/*
+	 * The common case is that it all fits on a page and we can
+	 * just do it the simple way.
+	 */
+	if (likely(!user_len2)) {
+		/*
+		 * We use 'int *user_state' to point to the state field, and
+		 * 'u64 *user_times' for runstate_entry_time. So the actual
+		 * array of time[] in each state starts at user_times[1].
+		 */
+		int *user_state = gpc1->khva;
+		u64 *user_times = gpc1->khva + times_ofs;
 
-void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, int state)
-{
-	struct kvm_vcpu_xen *vx = &v->arch.xen;
-	struct gfn_to_pfn_cache *gpc = &vx->runstate_cache;
-	uint64_t *user_times;
-	unsigned long flags;
-	size_t user_len;
-	int *user_state;
+		/*
+		 * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
+		 * field. We need to set it (and write-barrier) before writing to the
+		 * the rest of the structure, and clear it last. Just as Xen does, we
+		 * address the single *byte* in which it resides because it might be
+		 * in a different cache line to the rest of the 64-bit word, due to
+		 * the (lack of) alignment constraints.
+		 */
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+			     sizeof(uint64_t));
+		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
+			     sizeof(uint64_t));
+		BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
 
-	kvm_xen_update_runstate(v, state);
+		update_bit = ((u8 *)(&user_times[1])) - 1;
+		*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
+		smp_wmb();
 
-	if (!vx->runstate_cache.active)
-		return;
+		/*
+		 * Next, write the new runstate. This is in the *same* place
+		 * for 32-bit and 64-bit guests, asserted here for paranoia.
+		 */
+		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+			     offsetof(struct compat_vcpu_runstate_info, state));
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+			     sizeof(vx->current_runstate));
+		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+			     sizeof(vx->current_runstate));
+		*user_state = vx->current_runstate;
 
-	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode)
-		user_len = sizeof(struct vcpu_runstate_info);
-	else
-		user_len = sizeof(struct compat_vcpu_runstate_info);
+		/*
+		 * Then the actual runstate_entry_time (with the UPDATE bit
+		 * still set).
+		 */
+		*user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
 
-	read_lock_irqsave(&gpc->lock, flags);
-	while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc, gpc->gpa,
-					   user_len)) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		/*
+		 * Write the actual runstate times immediately after the
+		 * runstate_entry_time.
+		 */
+		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+			     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+		BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
+			     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+			     sizeof_field(struct compat_vcpu_runstate_info, time));
+		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+			     sizeof(vx->runstate_times));
+		memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
+
+		smp_wmb();
+
+		/*
+		 * Finally, clear the 'updating' bit. Don't use &= here because
+		 * the compiler may not realise that update_bit and user_times
+		 * point to the same place. That's a classic pointer-aliasing
+		 * problem.
+		 */
+		*update_bit = vx->runstate_entry_time >> 56;
+		smp_wmb();
+
+		goto done_1;
+	}
+
+	/*
+	 * The painful code path. It's split across two pages and we need to
+	 * hold and validate both GPCs simultaneously. Thankfully we can get
+	 * away with declaring a lock ordering GPC1 > GPC2 because nothing
+	 * else takes them more than one at a time.
+	 */
+	read_lock(&gpc2->lock);
+
+	if (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+		read_unlock(&gpc2->lock);
+		read_unlock_irqrestore(&gpc1->lock, flags);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
-		if (state == RUNSTATE_runnable)
+		if (atomic)
 			return;
 
-		if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc, gpc->gpa, user_len))
+		/*
+		 * Use kvm_gpc_activate() here because if the runstate
+		 * area was configured in 32-bit mode and only extends
+		 * to the second page now because the guest changed to
+		 * 64-bit mode, the second GPC won't have been set up.
+		 */
+		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
+				     gpc1->gpa + user_len1, user_len2))
 			return;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		/*
+		 * We dropped the lock on GPC1 so we have to go all the way
+		 * back and revalidate that too.
+		 */
+		goto retry;
 	}
 
 	/*
-	 * The only difference between 32-bit and 64-bit versions of the
-	 * runstate struct us the alignment of uint64_t in 32-bit, which
-	 * means that the 64-bit version has an additional 4 bytes of
-	 * padding after the first field 'state'.
-	 *
-	 * So we use 'int __user *user_state' to point to the state field,
-	 * and 'uint64_t __user *user_times' for runstate_entry_time. So
-	 * the actual array of time[] in each state starts at user_times[1].
+	 * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
 	 */
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
-	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
-	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
-#ifdef CONFIG_X86_64
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
-		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
-#endif
-
-	user_state = gpc->khva;
-
-	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode)
-		user_times = gpc->khva + offsetof(struct vcpu_runstate_info,
-						  state_entry_time);
+	if (user_len1 >= times_ofs + sizeof(uint64_t))
+		update_bit = ((u8 *)gpc1->khva) + times_ofs + sizeof(u64) - 1;
 	else
-		user_times = gpc->khva + offsetof(struct compat_vcpu_runstate_info,
-						  state_entry_time);
+		update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
+			user_len1;
 
-	/*
-	 * First write the updated state_entry_time at the appropriate
-	 * location determined by 'offset'.
+
+	/* Create a structure on our stack with everything in the right place.
+	 * The rs_state pointer points to the start of it, which in the case
+	 * of a compat guest on a 64-bit host is the 32 bit field that the
+	 * compiler thinks is padding.
 	 */
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
-		     sizeof(user_times[0]));
-	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
-		     sizeof(user_times[0]));
+	*rs_state = vx->current_runstate;
+#ifdef CONFIG_X86_64
+	/* Don't leak kernel memory through the padding in the 64-bit struct */
+	if (rs_state == &rs.state)
+		rs_state[1] = 0;
+#endif
+	rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+	memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
 
-	user_times[0] = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+	*update_bit = rs.state_entry_time >> 56;
 	smp_wmb();
 
 	/*
-	 * Next, write the new runstate. This is in the *same* place
-	 * for 32-bit and 64-bit guests, asserted here for paranoia.
+	 * Having constructed the structure, copy it into the first and
+	 * second pages as appropriate using user_len1 and user_len2.
 	 */
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-		     offsetof(struct compat_vcpu_runstate_info, state));
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-		     sizeof(vx->current_runstate));
-	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-		     sizeof(vx->current_runstate));
-
-	*user_state = vx->current_runstate;
+	memcpy(gpc1->khva, rs_state, user_len1);
+	memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+	smp_wmb();
 
 	/*
-	 * Write the actual runstate times immediately after the
-	 * runstate_entry_time.
+	 * Finally, clear the XEN_RUNSTATE_UPDATE bit.
 	 */
-	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-		     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
-		     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-		     sizeof_field(struct compat_vcpu_runstate_info, time));
-	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-		     sizeof(vx->runstate_times));
-
-	memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
+	*update_bit = vx->runstate_entry_time >> 56;
 	smp_wmb();
 
+	if (user_len2)
+		read_unlock(&gpc2->lock);
+ done_1:
+	read_unlock_irqrestore(&gpc1->lock, flags);
+
+	mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);
+	if (user_len2)
+		mark_page_dirty_in_slot(v->kvm, gpc2->memslot, gpc2->gpa >> PAGE_SHIFT);
+}
+
+void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
+{
+	struct kvm_vcpu_xen *vx = &v->arch.xen;
+	u64 now = get_kvmclock_ns(v->kvm);
+	u64 delta_ns = now - vx->runstate_entry_time;
+	u64 run_delay = current->sched_info.run_delay;
+
+	if (unlikely(!vx->runstate_entry_time))
+		vx->current_runstate = RUNSTATE_offline;
+
 	/*
-	 * Finally, clear the XEN_RUNSTATE_UPDATE bit in the guest's
-	 * runstate_entry_time field.
+	 * Time waiting for the scheduler isn't "stolen" if the
+	 * vCPU wasn't running anyway.
 	 */
-	user_times[0] &= ~XEN_RUNSTATE_UPDATE;
-	smp_wmb();
+	if (vx->current_runstate == RUNSTATE_running) {
+		u64 steal_ns = run_delay - vx->last_steal;
 
-	read_unlock_irqrestore(&gpc->lock, flags);
+		delta_ns -= steal_ns;
 
-	mark_page_dirty_in_slot(v->kvm, gpc->memslot, gpc->gpa >> PAGE_SHIFT);
+		vx->runstate_times[RUNSTATE_runnable] += steal_ns;
+	}
+	vx->last_steal = run_delay;
+
+	vx->runstate_times[vx->current_runstate] += delta_ns;
+	vx->current_runstate = state;
+	vx->runstate_entry_time = now;
+
+	if (vx->runstate_cache.active)
+		kvm_xen_update_runstate_guest(v, state == RUNSTATE_runnable);
 }
 
 static void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
@@ -584,23 +705,57 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		break;
 
-	case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR:
+	case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR: {
+		size_t sz, sz1, sz2;
+
 		if (!sched_info_on()) {
 			r = -EOPNOTSUPP;
 			break;
 		}
 		if (data->u.gpa == GPA_INVALID) {
+			r = 0;
+		deactivate_out:
 			kvm_gpc_deactivate(vcpu->kvm,
 					   &vcpu->arch.xen.runstate_cache);
-			r = 0;
+			kvm_gpc_deactivate(vcpu->kvm,
+					   &vcpu->arch.xen.runstate2_cache);
 			break;
 		}
 
+		/*
+		 * If the guest switches to 64-bit mode after setting the runstate
+		 * address, that's actually OK. kvm_xen_update_runstate_guest()
+		 * will cope.
+		 */
+		if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode)
+			sz = sizeof(struct vcpu_runstate_info);
+		else
+			sz = sizeof(struct compat_vcpu_runstate_info);
+
+		/* How much fits in the (first) page? */
+		sz1 = PAGE_SIZE - (data->u.gpa & ~PAGE_MASK);
 		r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate_cache,
-				     NULL, KVM_HOST_USES_PFN, data->u.gpa,
-				     sizeof(struct vcpu_runstate_info));
-		break;
+				     NULL, KVM_HOST_USES_PFN, data->u.gpa, sz1);
+		if (r)
+			goto deactivate_out;
 
+		/* Either map the second page, or deactivate the second GPC */
+		if (sz1 > sz) {
+			kvm_gpc_deactivate(vcpu->kvm,
+					   &vcpu->arch.xen.runstate2_cache);
+		} else {
+			sz2 = sz - sz1;
+			BUG_ON((data->u.gpa + sz1) & ~PAGE_MASK);
+			r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate2_cache,
+					     NULL, KVM_HOST_USES_PFN,
+					     data->u.gpa + sz1, sz2);
+			if (r)
+				goto deactivate_out;
+		}
+
+		kvm_xen_update_runstate_guest(vcpu, false);
+		break;
+	}
 	case KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT:
 		if (!sched_info_on()) {
 			r = -EOPNOTSUPP;
@@ -1834,6 +1989,7 @@ void kvm_xen_init_vcpu(struct kvm_vcpu *vcpu)
 	timer_setup(&vcpu->arch.xen.poll_timer, cancel_evtchn_poll, 0);
 
 	kvm_gpc_init(&vcpu->arch.xen.runstate_cache);
+	kvm_gpc_init(&vcpu->arch.xen.runstate2_cache);
 	kvm_gpc_init(&vcpu->arch.xen.vcpu_info_cache);
 	kvm_gpc_init(&vcpu->arch.xen.vcpu_time_info_cache);
 }
@@ -1844,6 +2000,7 @@ void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
 		kvm_xen_stop_timer(vcpu);
 
 	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.runstate_cache);
+	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.runstate2_cache);
 	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.vcpu_info_cache);
 	kvm_gpc_deactivate(vcpu->kvm, &vcpu->arch.xen.vcpu_time_info_cache);
 
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index 532a535a9e99..8503d2c6891e 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -143,11 +143,11 @@ int kvm_xen_hypercall(struct kvm_vcpu *vcpu);
 #include <asm/xen/interface.h>
 #include <xen/interface/vcpu.h>
 
-void kvm_xen_update_runstate_guest(struct kvm_vcpu *vcpu, int state);
+void kvm_xen_update_runstate(struct kvm_vcpu *vcpu, int state);
 
 static inline void kvm_xen_runstate_set_running(struct kvm_vcpu *vcpu)
 {
-	kvm_xen_update_runstate_guest(vcpu, RUNSTATE_running);
+	kvm_xen_update_runstate(vcpu, RUNSTATE_running);
 }
 
 static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
@@ -162,7 +162,7 @@ static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
 	if (WARN_ON_ONCE(!vcpu->preempted))
 		return;
 
-	kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable);
+	kvm_xen_update_runstate(vcpu, RUNSTATE_runnable);
 }
 
 /* 32-bit compatibility definitions, also used natively in 32-bit build */
diff --git a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
index 2a5727188c8d..7f39815f1772 100644
--- a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
+++ b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
@@ -26,17 +26,17 @@
 #define SHINFO_REGION_GPA	0xc0000000ULL
 #define SHINFO_REGION_SLOT	10
 
-#define DUMMY_REGION_GPA	(SHINFO_REGION_GPA + (2 * PAGE_SIZE))
+#define DUMMY_REGION_GPA	(SHINFO_REGION_GPA + (3 * PAGE_SIZE))
 #define DUMMY_REGION_SLOT	11
 
 #define SHINFO_ADDR	(SHINFO_REGION_GPA)
-#define PVTIME_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE)
-#define RUNSTATE_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE + 0x20)
 #define VCPU_INFO_ADDR	(SHINFO_REGION_GPA + 0x40)
+#define PVTIME_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE)
+#define RUNSTATE_ADDR	(SHINFO_REGION_GPA + PAGE_SIZE + PAGE_SIZE - 15)
 
 #define SHINFO_VADDR	(SHINFO_REGION_GVA)
-#define RUNSTATE_VADDR	(SHINFO_REGION_GVA + PAGE_SIZE + 0x20)
 #define VCPU_INFO_VADDR	(SHINFO_REGION_GVA + 0x40)
+#define RUNSTATE_VADDR	(SHINFO_REGION_GVA + PAGE_SIZE + PAGE_SIZE - 15)
 
 #define EVTCHN_VECTOR	0x10
 
@@ -449,8 +449,8 @@ int main(int argc, char *argv[])
 
 	/* Map a region for the shared_info page */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
-				    SHINFO_REGION_GPA, SHINFO_REGION_SLOT, 2, 0);
-	virt_map(vm, SHINFO_REGION_GVA, SHINFO_REGION_GPA, 2);
+				    SHINFO_REGION_GPA, SHINFO_REGION_SLOT, 3, 0);
+	virt_map(vm, SHINFO_REGION_GVA, SHINFO_REGION_GPA, 3);
 
 	struct shared_info *shinfo = addr_gpa2hva(vm, SHINFO_VADDR);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-19  9:46 [PATCH 1/4] MAINTAINERS: Add KVM x86/xen maintainer list David Woodhouse
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
@ 2022-11-19  9:46 ` David Woodhouse
  2022-11-19 11:40   ` Durrant, Paul
  2022-11-19  9:46 ` [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary David Woodhouse
  2 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-19  9:46 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

From: David Woodhouse <dwmw@amazon.co.uk>

In the case where a GPC is refreshed to a different location within the
same page, we didn't bother to update it. Mostly we don't need to, but
since the ->khva field also includes the offset within the page, that
does have to be updated.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 virt/kvm/pfncache.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index bd4a46aee384..5f83321bfd2a 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -297,7 +297,12 @@ int kvm_gfn_to_pfn_cache_refresh(struct kvm *kvm, struct gfn_to_pfn_cache *gpc,
 	if (!gpc->valid || old_uhva != gpc->uhva) {
 		ret = hva_to_pfn_retry(kvm, gpc);
 	} else {
-		/* If the HVA→PFN mapping was already valid, don't unmap it. */
+		/*
+		 * If the HVA→PFN mapping was already valid, don't unmap it.
+		 * But do update gpc->khva because the offset within the page
+		 * may have changed.
+		 */
+		gpc->khva = old_khva + page_offset;
 		old_pfn = KVM_PFN_ERR_FAULT;
 		old_khva = NULL;
 		ret = 0;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
  2022-11-19  9:46 [PATCH 1/4] MAINTAINERS: Add KVM x86/xen maintainer list David Woodhouse
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
  2022-11-19  9:46 ` [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page David Woodhouse
@ 2022-11-19  9:46 ` David Woodhouse
  2022-11-19 11:42   ` Durrant, Paul
  2 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-19  9:46 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

From: David Woodhouse <dwmw@amazon.co.uk>

Torture test the cases where the runstate crosses a page boundary, and
and especially the case where it's configured in 32-bit mode and doesn't,
but then switching to 64-bit mode makes it go onto the second page.

To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl
also update the guest runstate area. It already did so if the actual
runstate changed, as a side-effect of kvm_xen_update_runstate(). So
doing it in the plain adjustment case is making it more consistent, as
well as giving us a nice way to trigger the update without actually
running the vCPU again and changing the values.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c                            |   2 +
 .../selftests/kvm/x86_64/xen_shinfo_test.c    | 115 +++++++++++++++---
 2 files changed, 97 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 8aa953b1f0e0..747dc347c70e 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -848,6 +848,8 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 
 		if (data->u.runstate.state <= RUNSTATE_offline)
 			kvm_xen_update_runstate(vcpu, data->u.runstate.state);
+		else if (vcpu->arch.xen.runstate_cache.active)
+			kvm_xen_update_runstate_guest(vcpu, false);
 		r = 0;
 		break;
 
diff --git a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
index 7f39815f1772..1f4fd97db959 100644
--- a/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
+++ b/tools/testing/selftests/kvm/x86_64/xen_shinfo_test.c
@@ -88,14 +88,20 @@ struct pvclock_wall_clock {
 } __attribute__((__packed__));
 
 struct vcpu_runstate_info {
-    uint32_t state;
-    uint64_t state_entry_time;
-    uint64_t time[4];
+	uint32_t state;
+	uint64_t state_entry_time;
+	uint64_t time[5]; /* Extra field for overrun check */
 };
 
+struct compat_vcpu_runstate_info {
+	uint32_t state;
+	uint64_t state_entry_time;
+	uint64_t time[5];
+} __attribute__((__packed__));;
+
 struct arch_vcpu_info {
-    unsigned long cr2;
-    unsigned long pad; /* sizeof(vcpu_info_t) == 64 */
+	unsigned long cr2;
+	unsigned long pad; /* sizeof(vcpu_info_t) == 64 */
 };
 
 struct vcpu_info {
@@ -999,22 +1005,91 @@ int main(int argc, char *argv[])
 				       runstate_names[i], rs->time[i]);
 			}
 		}
-		TEST_ASSERT(rs->state == rst.u.runstate.state, "Runstate mismatch");
-		TEST_ASSERT(rs->state_entry_time == rst.u.runstate.state_entry_time,
-			    "State entry time mismatch");
-		TEST_ASSERT(rs->time[RUNSTATE_running] == rst.u.runstate.time_running,
-			    "Running time mismatch");
-		TEST_ASSERT(rs->time[RUNSTATE_runnable] == rst.u.runstate.time_runnable,
-			    "Runnable time mismatch");
-		TEST_ASSERT(rs->time[RUNSTATE_blocked] == rst.u.runstate.time_blocked,
-			    "Blocked time mismatch");
-		TEST_ASSERT(rs->time[RUNSTATE_offline] == rst.u.runstate.time_offline,
-			    "Offline time mismatch");
-
-		TEST_ASSERT(rs->state_entry_time == rs->time[0] +
-			    rs->time[1] + rs->time[2] + rs->time[3],
-			    "runstate times don't add up");
+
+		/*
+		 * Exercise runstate info at all points across the page boundary, in
+		 * 32-bit and 64-bit mode. In particular, test the case where it is
+		 * configured in 32-bit mode and then switched to 64-bit mode while
+		 * active, which takes it onto the second page.
+		 */
+		unsigned long runstate_addr;
+		struct compat_vcpu_runstate_info *crs;
+		for (runstate_addr = SHINFO_REGION_GPA + PAGE_SIZE + PAGE_SIZE - sizeof(*rs) - 4;
+		     runstate_addr < SHINFO_REGION_GPA + PAGE_SIZE + PAGE_SIZE + 4; runstate_addr++) {
+
+			rs = addr_gpa2hva(vm, runstate_addr);
+			crs = (void *)rs;
+
+			memset(rs, 0xa5, sizeof(*rs));
+
+			/* Set to compatibility mode */
+			lm.u.long_mode = 0;
+			vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &lm);
+
+			/* Set runstate to new address (kernel will write it) */
+			struct kvm_xen_vcpu_attr st = {
+				.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR,
+				.u.gpa = runstate_addr,
+			};
+			vcpu_ioctl(vcpu, KVM_XEN_VCPU_SET_ATTR, &st);
+
+			if (verbose)
+				printf("Compatibility runstate at %08lx\n", runstate_addr);
+
+			TEST_ASSERT(crs->state == rst.u.runstate.state, "Runstate mismatch");
+			TEST_ASSERT(crs->state_entry_time == rst.u.runstate.state_entry_time,
+				    "State entry time mismatch");
+			TEST_ASSERT(crs->time[RUNSTATE_running] == rst.u.runstate.time_running,
+				    "Running time mismatch");
+			TEST_ASSERT(crs->time[RUNSTATE_runnable] == rst.u.runstate.time_runnable,
+				    "Runnable time mismatch");
+			TEST_ASSERT(crs->time[RUNSTATE_blocked] == rst.u.runstate.time_blocked,
+				    "Blocked time mismatch");
+			TEST_ASSERT(crs->time[RUNSTATE_offline] == rst.u.runstate.time_offline,
+				    "Offline time mismatch");
+			TEST_ASSERT(crs->time[RUNSTATE_offline + 1] == 0xa5a5a5a5a5a5a5a5ULL,
+				    "Structure overrun");
+			TEST_ASSERT(crs->state_entry_time == crs->time[0] +
+				    crs->time[1] + crs->time[2] + crs->time[3],
+				    "runstate times don't add up");
+
+
+			/* Now switch to 64-bit mode */
+			lm.u.long_mode = 1;
+			vm_ioctl(vm, KVM_XEN_HVM_SET_ATTR, &lm);
+
+			memset(rs, 0xa5, sizeof(*rs));
+
+			/* Don't change the address, just trigger a write */
+			struct kvm_xen_vcpu_attr adj = {
+				.type = KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST,
+				.u.runstate.state = (uint64_t)-1
+			};
+			vcpu_ioctl(vcpu, KVM_XEN_VCPU_SET_ATTR, &adj);
+
+			if (verbose)
+				printf("64-bit runstate at %08lx\n", runstate_addr);
+
+			TEST_ASSERT(rs->state == rst.u.runstate.state, "Runstate mismatch");
+			TEST_ASSERT(rs->state_entry_time == rst.u.runstate.state_entry_time,
+				    "State entry time mismatch");
+			TEST_ASSERT(rs->time[RUNSTATE_running] == rst.u.runstate.time_running,
+				    "Running time mismatch");
+			TEST_ASSERT(rs->time[RUNSTATE_runnable] == rst.u.runstate.time_runnable,
+				    "Runnable time mismatch");
+			TEST_ASSERT(rs->time[RUNSTATE_blocked] == rst.u.runstate.time_blocked,
+				    "Blocked time mismatch");
+			TEST_ASSERT(rs->time[RUNSTATE_offline] == rst.u.runstate.time_offline,
+				    "Offline time mismatch");
+			TEST_ASSERT(rs->time[RUNSTATE_offline + 1] == 0xa5a5a5a5a5a5a5a5ULL,
+				    "Structure overrun");
+
+			TEST_ASSERT(rs->state_entry_time == rs->time[0] +
+				    rs->time[1] + rs->time[2] + rs->time[3],
+				    "runstate times don't add up");
+		}
 	}
+
 	kvm_vm_free(vm);
 	return 0;
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
@ 2022-11-19 11:38   ` Durrant, Paul
  2022-11-19 11:46     ` David Woodhouse
  2022-11-22 18:39   ` Sean Christopherson
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Durrant, Paul @ 2022-11-19 11:38 UTC (permalink / raw)
  To: David Woodhouse, Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

> -----Original Message-----
> From: David Woodhouse <dwmw2@infradead.org>
> Sent: 19 November 2022 09:47
> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; mhal@rbox.co
> Subject: [EXTERNAL] [PATCH 2/4] KVM: x86/xen: Compatibility fixes for
> shared runstate area
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The guest runstate area can be arbitrarily byte-aligned. In fact, even
> when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> fields in the structure end up being unaligned due to the fact that the
> 32-bit ABI only aligns them to 32 bits.
> 
> So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> is buggy, because if it's unaligned then we can't update the whole field
> atomically; the low bytes might be observable before the _UPDATE bit is.
> Xen actually updates the *byte* containing that top bit, on its own. KVM
> should do the same.
> 
> In addition, we cannot assume that the runstate area fits within a single
> page. One option might be to make the gfn_to_pfn cache cope with regions
> that cross a page — but getting a contiguous virtual kernel mapping of a
> discontiguous set of IOMEM pages is a distinctly non-trivial exercise, and
> it seems this is the *only* current use case for the GPC which would
> benefit from it.
> 
> An earlier version of the runstate code did use a gfn_to_hva cache for
> this purpose, but it still had the single-page restriction because it used
> the uhva directly — because it needs to be able to do so atomically when
> the vCPU is being scheduled out, so it used pagefault_disable() around the
> accesses and didn't just use kvm_write_guest_cached() which has a fallback
> path.
> 
> So... use a pair of GPCs for the first and potential second page covering
> the runstate area. We can get away with locking both at once because
> nothing else takes more than one GPC lock at a time so we can invent a
> trivial ordering rule.
> 
> Keep the trivial fast path for the common case where it's all in the same
> page, but fixed to use a byte access for the XEN_RUNSTATE_UPDATE bit. And
> in the cross-page case, build the structure locally on the stack and then
> copy it over with two memcpy() calls, again handling the
> XEN_RUNSTATE_UPDATE bit through a single byte access.
> 
> Finally, Xen also does write the runstate area immediately when it's
> configured. Flip the kvm_xen_update_runstate() and …_guest() functions and
> call the latter directly when the runstate area is set. This means that
> other ioctls which modify the runstate also write it immediately to the
> guest when they do so, which is also intended.
> 
> Update the xen_shinfo_test to exercise the pathological case where the
> XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
> actually in a different page to the rest of the 64-bit word.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/include/asm/kvm_host.h               |   1 +
>  arch/x86/kvm/xen.c                            | 367 +++++++++++++-----
>  arch/x86/kvm/xen.h                            |   6 +-
>  .../selftests/kvm/x86_64/xen_shinfo_test.c    |  12 +-
>  4 files changed, 272 insertions(+), 114 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index d1013c4f673c..70af7240a1d5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -686,6 +686,7 @@ struct kvm_vcpu_xen {
>         struct gfn_to_pfn_cache vcpu_info_cache;
>         struct gfn_to_pfn_cache vcpu_time_info_cache;
>         struct gfn_to_pfn_cache runstate_cache;
> +       struct gfn_to_pfn_cache runstate2_cache;
>         u64 last_steal;
>         u64 runstate_entry_time;
>         u64 runstate_times[4];
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c index
> 4b8e9628fbf5..8aa953b1f0e0 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -170,148 +170,269 @@ static void kvm_xen_init_timer(struct kvm_vcpu
> *vcpu)
>         vcpu->arch.xen.timer.function = xen_timer_callback;  }
> 
> -static void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
> +static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool
> +atomic)
>  {
>         struct kvm_vcpu_xen *vx = &v->arch.xen;
> -       u64 now = get_kvmclock_ns(v->kvm);
> -       u64 delta_ns = now - vx->runstate_entry_time;
> -       u64 run_delay = current->sched_info.run_delay;
> +       struct gfn_to_pfn_cache *gpc1 = &vx->runstate_cache;
> +       struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
> +       size_t user_len, user_len1, user_len2;

I think it might make the code marginally neater to use two-element arrays... but only marginally.

> +       struct vcpu_runstate_info rs;
> +       int *rs_state = &rs.state;
> +       unsigned long flags;
> +       size_t times_ofs;
> +       u8 *update_bit;
> 
> -       if (unlikely(!vx->runstate_entry_time))
> -               vx->current_runstate = RUNSTATE_offline;
> +       /*
> +        * The only difference between 32-bit and 64-bit versions of the
> +        * runstate struct us the alignment of uint64_t in 32-bit, which
> +        * means that the 64-bit version has an additional 4 bytes of
> +        * padding after the first field 'state'.
> +        */
> +       BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
> +       BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) !=
> 0);
> +       BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
> +#ifdef CONFIG_X86_64
> +       BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time)
> !=
> +                    offsetof(struct compat_vcpu_runstate_info,
> state_entry_time) + 4);
> +       BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
> +                    offsetof(struct compat_vcpu_runstate_info, time) +
> +4); #endif
> +
> +       if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
> +               user_len = sizeof(struct vcpu_runstate_info);
> +               times_ofs = offsetof(struct vcpu_runstate_info,
> +                                    state_entry_time);
> +       } else {
> +               user_len = sizeof(struct compat_vcpu_runstate_info);
> +               times_ofs = offsetof(struct compat_vcpu_runstate_info,
> +                                    state_entry_time);
> +               rs_state++;
> +       }
> 
>         /*
> -        * Time waiting for the scheduler isn't "stolen" if the
> -        * vCPU wasn't running anyway.
> +        * There are basically no alignment constraints. The guest can set
> it
> +        * up so it crosses from one page to the next, and at arbitrary
> byte
> +        * alignment (and the 32-bit ABI doesn't align the 64-bit integers
> +        * anyway, even if the overall struct had been 64-bit aligned).
>          */
> -       if (vx->current_runstate == RUNSTATE_running) {
> -               u64 steal_ns = run_delay - vx->last_steal;
> +       if ((gpc1->gpa & ~PAGE_MASK) + user_len >= PAGE_SIZE) {
> +               user_len1 = PAGE_SIZE - (gpc1->gpa & ~PAGE_MASK);
> +               user_len2 = user_len - user_len1;
> +       } else {
> +               user_len1 = user_len;
> +               user_len2 = 0;
> +       }
> +       BUG_ON(user_len1 + user_len2 != user_len);
> 
> -               delta_ns -= steal_ns;
> + retry:
> +       /*
> +        * Attempt to obtain the GPC lock on *both* (if there are two)
> +        * gfn_to_pfn caches that cover the region.
> +        */
> +       read_lock_irqsave(&gpc1->lock, flags);
> +       while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc1, gpc1->gpa,
> user_len1)) {
> +               read_unlock_irqrestore(&gpc1->lock, flags);
> 
> -               vx->runstate_times[RUNSTATE_runnable] += steal_ns;
> +               /* When invoked from kvm_sched_out() we cannot sleep */
> +               if (atomic)
> +                       return;
> +
> +               if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc1, gpc1->gpa,
> user_len1))
> +                       return;
> +
> +               read_lock_irqsave(&gpc1->lock, flags);

This is a repeated pattern now, so how about a helper function?

  Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-19  9:46 ` [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page David Woodhouse
@ 2022-11-19 11:40   ` Durrant, Paul
  2022-11-22 16:23     ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: Durrant, Paul @ 2022-11-19 11:40 UTC (permalink / raw)
  To: David Woodhouse, Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

> -----Original Message-----
> From: David Woodhouse <dwmw2@infradead.org>
> Sent: 19 November 2022 09:47
> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; mhal@rbox.co
> Subject: [EXTERNAL] [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it
> moves within the same page
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> In the case where a GPC is refreshed to a different location within the
> same page, we didn't bother to update it. Mostly we don't need to, but
> since the ->khva field also includes the offset within the page, that does
> have to be updated.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Reviewed-by: <paul@xen.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
  2022-11-19  9:46 ` [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary David Woodhouse
@ 2022-11-19 11:42   ` Durrant, Paul
  0 siblings, 0 replies; 32+ messages in thread
From: Durrant, Paul @ 2022-11-19 11:42 UTC (permalink / raw)
  To: David Woodhouse, Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

> -----Original Message-----
> From: David Woodhouse <dwmw2@infradead.org>
> Sent: 19 November 2022 09:47
> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> <seanjc@google.com>
> Cc: kvm@vger.kernel.org; mhal@rbox.co
> Subject: [EXTERNAL] [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-
> bit mode and crossing page boundary
> 
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
> 
> 
> 
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Torture test the cases where the runstate crosses a page boundary, and and
> especially the case where it's configured in 32-bit mode and doesn't, but
> then switching to 64-bit mode makes it go onto the second page.
> 
> To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl
> also update the guest runstate area. It already did so if the actual
> runstate changed, as a side-effect of kvm_xen_update_runstate(). So doing
> it in the plain adjustment case is making it more consistent, as well as
> giving us a nice way to trigger the update without actually running the
> vCPU again and changing the values.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

Reviewed-by: Paul Durrant <paul@xen.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-19 11:38   ` Durrant, Paul
@ 2022-11-19 11:46     ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-19 11:46 UTC (permalink / raw)
  To: Durrant, Paul, Paolo Bonzini, Sean Christopherson; +Cc: kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Sat, 2022-11-19 at 11:38 +0000, Durrant, Paul wrote:
> > +       size_t user_len, user_len1, user_len2;
> 
> I think it might make the code marginally neater to use two-element arrays... but only marginally.

Not sure; I kind of prefer it like this. At least overall, this version
is somewhat nicer than some of the interim attempts that you saw... :)


> > +        * Attempt to obtain the GPC lock on *both* (if there are two)
> > +        * gfn_to_pfn caches that cover the region.
> > +        */
> > +       read_lock_irqsave(&gpc1->lock, flags);
> > +       while (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc1, gpc1->gpa, user_len1)) {
> > +               read_unlock_irqrestore(&gpc1->lock, flags);
> > 
> > -               vx->runstate_times[RUNSTATE_runnable] += steal_ns;
> > +               /* When invoked from kvm_sched_out() we cannot sleep */
> > +               if (atomic)
> > +                       return;
> > +
> > +               if (kvm_gfn_to_pfn_cache_refresh(v->kvm, gpc1, gpc1->gpa, user_len1))
> > +                       return;
> > +
> > +               read_lock_irqsave(&gpc1->lock, flags);
> 
> This is a repeated pattern now, so how about a helper function?

I've been tempted by that a few times but although it's a repeating
pattern, there are differences every time around the conditions for
bailing out early — the if (atomic) part above.




[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-19 11:40   ` Durrant, Paul
@ 2022-11-22 16:23     ` Sean Christopherson
  2022-11-22 16:49       ` Paul Durrant
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-22 16:23 UTC (permalink / raw)
  To: Durrant, Paul; +Cc: David Woodhouse, Paolo Bonzini, kvm, mhal

On Sat, Nov 19, 2022, Durrant, Paul wrote:
> > -----Original Message-----
> > From: David Woodhouse <dwmw2@infradead.org>
> > Sent: 19 November 2022 09:47
> > To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
> > <seanjc@google.com>
> > Cc: kvm@vger.kernel.org; mhal@rbox.co
> > Subject: [EXTERNAL] [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it
> > moves within the same page

Please use a mail client that doesn't include the header gunk in the reply.

> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and know
> > the content is safe.
> > 
> > 
> > 
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > In the case where a GPC is refreshed to a different location within the
> > same page, we didn't bother to update it. Mostly we don't need to, but
> > since the ->khva field also includes the offset within the page, that does
> > have to be updated.
> > 
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> Reviewed-by: <paul@xen.org>

Tags need your real name, not just your email address, i.e. this should be:

  Reviewed-by: Paul Durrant <paul@xen.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-22 16:23     ` Sean Christopherson
@ 2022-11-22 16:49       ` Paul Durrant
  2022-11-22 17:02         ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: Paul Durrant @ 2022-11-22 16:49 UTC (permalink / raw)
  To: Sean Christopherson, Durrant, Paul
  Cc: David Woodhouse, Paolo Bonzini, kvm, mhal

On 22/11/2022 16:23, Sean Christopherson wrote:
> On Sat, Nov 19, 2022, Durrant, Paul wrote:
>>> -----Original Message-----
>>> From: David Woodhouse <dwmw2@infradead.org>
>>> Sent: 19 November 2022 09:47
>>> To: Paolo Bonzini <pbonzini@redhat.com>; Sean Christopherson
>>> <seanjc@google.com>
>>> Cc: kvm@vger.kernel.org; mhal@rbox.co
>>> Subject: [EXTERNAL] [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it
>>> moves within the same page
> 
> Please use a mail client that doesn't include the header gunk in the reply.
> 

Sorry about that; it's not just a change of client unfortunately but I 
should be avoiding the problem now...

>>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Reviewed-by: <paul@xen.org>
> 
> Tags need your real name, not just your email address, i.e. this should be:
> 
>    Reviewed-by: Paul Durrant <paul@xen.org>

Yes indeed it should. Don't know how I managed to screw that up... It's 
not like haven't type that properly hundreds of times on Xen patch reviews.

   Paul


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-22 16:49       ` Paul Durrant
@ 2022-11-22 17:02         ` David Woodhouse
  2022-11-22 19:09           ` Sean Christopherson
  2022-11-24  0:31           ` Paolo Bonzini
  0 siblings, 2 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-22 17:02 UTC (permalink / raw)
  To: Paul Durrant, Sean Christopherson, Durrant, Paul; +Cc: Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > Tags need your real name, not just your email address, i.e. this should be:
> >     Reviewed-by: Paul Durrant <paul@xen.org>
> 
> Yes indeed it should. Don't know how I managed to screw that up... It's 
> not like haven't type that properly hundreds of times on Xen patch reviews.

All sorted in the tree I'm curating
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes

Of those, three are marked as Cc:stable and want to go into the 6.1 release:

      KVM: x86/xen: Validate port number in SCHEDOP_poll
      KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
      KVM: Update gfn_to_pfn_cache khva when it moves within the same page

The rest (including the runstate compatibility fixes) are less
critical.

Sean, given that this now includes your patch series which in turn you
took over from Michal, how would you prefer me to proceed?

David Woodhouse (7):
      KVM: x86/xen: Validate port number in SCHEDOP_poll
      KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
      KVM: x86/xen: Add CPL to Xen hypercall tracepoint
      MAINTAINERS: Add KVM x86/xen maintainer list
      KVM: x86/xen: Compatibility fixes for shared runstate area
      KVM: Update gfn_to_pfn_cache khva when it moves within the same page
      KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary

Michal Luczaj (6):
      KVM: Shorten gfn_to_pfn_cache function names
      KVM: x86: Remove unused argument in gpc_unmap_khva()
      KVM: Store immutable gfn_to_pfn_cache properties
      KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
      KVM: Clean up hva_to_pfn_retry()
      KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()

Sean Christopherson (4):
      KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
      KVM: Do not partially reinitialize gfn=>pfn cache during activation
      KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
      KVM: Skip unnecessary "unmap" if gpc is already valid during refresh

We can reinstate the 'immutable length' thing on top, if we pick one of
the discussed options for coping with the fact that for the runstate
area, it *isn't* immutable. I'm slightly leaning towards just setting
the length to '1' despite it being a lie.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
  2022-11-19 11:38   ` Durrant, Paul
@ 2022-11-22 18:39   ` Sean Christopherson
  2022-11-22 23:04     ` David Woodhouse
  2022-11-22 23:46   ` Michal Luczaj
  2022-11-23 18:21   ` Sean Christopherson
  3 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-22 18:39 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paolo Bonzini, kvm, mhal

On Sat, Nov 19, 2022, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The guest runstate area can be arbitrarily byte-aligned. In fact, even
> when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> fields in the structure end up being unaligned due to the fact that the
> 32-bit ABI only aligns them to 32 bits.
> 
> So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> is buggy, because if it's unaligned then we can't update the whole field
> atomically; the low bytes might be observable before the _UPDATE bit is.
> Xen actually updates the *byte* containing that top bit, on its own. KVM
> should do the same.

I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
_need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
other hand most definitely doesn't need to know the pfn.

The event channel code would be difficult to convert due to lack of uaccess
primitives, but I don't see anything in the runstate code that prevents KVM from
using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
down a slow path and would yield far simpler code.

If taking the slow path is an issue, then the guest really should be fixed to not
split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
could be updated to use the fast path if the region is contiguous in the host
virtual address space.

> +static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
>  {
>  	struct kvm_vcpu_xen *vx = &v->arch.xen;
> -	u64 now = get_kvmclock_ns(v->kvm);
> -	u64 delta_ns = now - vx->runstate_entry_time;
> -	u64 run_delay = current->sched_info.run_delay;
> +	struct gfn_to_pfn_cache *gpc1 = &vx->runstate_cache;
> +	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
> +	size_t user_len, user_len1, user_len2;
> +	struct vcpu_runstate_info rs;
> +	int *rs_state = &rs.state;
> +	unsigned long flags;
> +	size_t times_ofs;
> +	u8 *update_bit;
>  
> -	if (unlikely(!vx->runstate_entry_time))
> -		vx->current_runstate = RUNSTATE_offline;
> +	/*
> +	 * The only difference between 32-bit and 64-bit versions of the
> +	 * runstate struct us the alignment of uint64_t in 32-bit, which

s/us/is

> +	 * means that the 64-bit version has an additional 4 bytes of
> +	 * padding after the first field 'state'.
> +	 */
> +	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
> +	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
> +	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
> +#ifdef CONFIG_X86_64
> +	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
> +		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
> +	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
> +		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
> +#endif
> +
> +	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
> +		user_len = sizeof(struct vcpu_runstate_info);
> +		times_ofs = offsetof(struct vcpu_runstate_info,
> +				     state_entry_time);
> +	} else {
> +		user_len = sizeof(struct compat_vcpu_runstate_info);
> +		times_ofs = offsetof(struct compat_vcpu_runstate_info,
> +				     state_entry_time);
> +		rs_state++;

...

> +	*rs_state = vx->current_runstate;
> +#ifdef CONFIG_X86_64
> +	/* Don't leak kernel memory through the padding in the 64-bit struct */
> +	if (rs_state == &rs.state)
> +		rs_state[1] = 0;

Oof, that's difficult to follow.  Rather than pointer magic, what about zeroing
the first word unconditionally?  Likely faster than a CMP+CMOV or whatever gets
generated.  Or just zero the entire struct.

	/* Blah blah blah */
	*(unsigned long *)&rs.state = 0;

> +#endif

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-22 17:02         ` David Woodhouse
@ 2022-11-22 19:09           ` Sean Christopherson
  2022-11-22 23:13             ` David Woodhouse
  2022-11-24  0:31           ` Paolo Bonzini
  1 sibling, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-22 19:09 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paul Durrant, Durrant, Paul, Paolo Bonzini, kvm, mhal

On Tue, Nov 22, 2022, David Woodhouse wrote:
> On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > Tags need your real name, not just your email address, i.e. this should be:
> > >     Reviewed-by: Paul Durrant <paul@xen.org>
> > 
> > Yes indeed it should. Don't know how I managed to screw that up... It's 
> > not like haven't type that properly hundreds of times on Xen patch reviews.
> 
> All sorted in the tree I'm curating
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
> 
> Of those, three are marked as Cc:stable and want to go into the 6.1 release:
> 
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
> 
> The rest (including the runstate compatibility fixes) are less
> critical.
> 
> Sean, given that this now includes your patch series which in turn you
> took over from Michal, how would you prefer me to proceed?

If you're ok with a slight delay for fixing the runstate stuff, I think it makes
sense to push out everything else to fixes out to 6.3.  At the very least, I want
to explore using a gfn_to_hva_cache instead of two gfn_to_pfn_cache structures
for the runstate, which at this point would be a bit rushed for 6.2.

Pushing out to 6.3 will also give us time to do the more aggressive cleanup of
adding kvm_gpc_lock(), which I think will yield APIs that both of us are happy with,
i.e. the gpa+len of the cache can be immutable, but users can still validate their
individual usage.

> David Woodhouse (7):
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: x86/xen: Add CPL to Xen hypercall tracepoint
>       MAINTAINERS: Add KVM x86/xen maintainer list
>       KVM: x86/xen: Compatibility fixes for shared runstate area
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>       KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
> 
> Michal Luczaj (6):
>       KVM: Shorten gfn_to_pfn_cache function names
>       KVM: x86: Remove unused argument in gpc_unmap_khva()
>       KVM: Store immutable gfn_to_pfn_cache properties
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
>       KVM: Clean up hva_to_pfn_retry()
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()
> 
> Sean Christopherson (4):
>       KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
>       KVM: Do not partially reinitialize gfn=>pfn cache during activation
>       KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
>       KVM: Skip unnecessary "unmap" if gpc is already valid during refresh
> 
> We can reinstate the 'immutable length' thing on top, if we pick one of
> the discussed options for coping with the fact that for the runstate
> area, it *isn't* immutable. I'm slightly leaning towards just setting
> the length to '1' despite it being a lie.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-22 18:39   ` Sean Christopherson
@ 2022-11-22 23:04     ` David Woodhouse
  2022-11-23 17:17       ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-22 23:04 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 4016 bytes --]

On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote:
> On Sat, Nov 19, 2022, David Woodhouse wrote:
> > From: David Woodhouse <
> > dwmw@amazon.co.uk
> > >
> > 
> > The guest runstate area can be arbitrarily byte-aligned. In fact, even
> > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> > fields in the structure end up being unaligned due to the fact that the
> > 32-bit ABI only aligns them to 32 bits.
> > 
> > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> > is buggy, because if it's unaligned then we can't update the whole field
> > atomically; the low bytes might be observable before the _UPDATE bit is.
> > Xen actually updates the *byte* containing that top bit, on its own. KVM
> > should do the same.
> 
> I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
> _need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
> other hand most definitely doesn't need to know the pfn.
> 
> The event channel code would be difficult to convert due to lack of uaccess
> primitives, but I don't see anything in the runstate code that prevents KVM from
> using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
> down a slow path and would yield far simpler code.
> 
> If taking the slow path is an issue, then the guest really should be fixed to not
> split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
> could be updated to use the fast path if the region is contiguous in the host
> virtual address space.
> 

Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
thing I tried. There are a handful of complexifying factors, none of
which are insurmountable if we try hard enough.

 • Even if we fix the gfn_to_hva_cache to still use the fast path for
   more than the first page, it's still possible for the runstate to
   cross from one memslot to an adjacent one. We probably still need
   two of them, which is a large part of the ugliness of this patch.

 • Accessing it via the userspace HVA requires coping with the case
   where the vCPU is being scheduled out, and it can't sleep. The
   previous code before commit 34d41d19e dealt with that by using 
   pagefault_disable() and depending on the GHC fast path. But even
   so...

 • We also could end up having to touch page B, page A then page B
   again, potentially having to abort and leave the runstate with
   the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
   version which checks that both pages are present before it starts
   writing anything to the guest.

As I said, they're not insurmountable but that's why I ended up with
the patch you see, after first attempting to use a gfn_to_hva_cache
again. Happy to entertain suggestions.

I note we have a kvm_read_guest_atomic() and briefly pondered adding a
'write' version of same... but that doesn't seem to cope with crossing
memslots either; it also assumes its caller only uses it to access
within a single page.

> > +	*rs_state = vx->current_runstate;
> > +#ifdef CONFIG_X86_64
> > +	/* Don't leak kernel memory through the padding in the 64-bit struct */
> > +	if (rs_state == &rs.state)
> > +		rs_state[1] = 0;
> 
> Oof, that's difficult to follow.  Rather than pointer magic, what about zeroing
> the first word unconditionally?  Likely faster than a CMP+CMOV or whatever gets
> generated.  Or just zero the entire struct.
> 
> 	/* Blah blah blah */
> 	*(unsigned long *)&rs.state = 0;

That gives us pointer aliasing problems, which I hate even if we *do*
compile with -fno-strict-aliasing. I'll use a memset which ought to end
up being precisely the same in practice.

#ifdef CONFIG_X86_64
		/*
		 * Don't leak kernel memory through the padding in the 64-bit
		 * struct if we take the split-page path.
		 */
		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
#endif

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-22 19:09           ` Sean Christopherson
@ 2022-11-22 23:13             ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-22 23:13 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paul Durrant, Durrant, Paul, Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]

On Tue, 2022-11-22 at 19:09 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, David Woodhouse wrote:
> > On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > > Tags need your real name, not just your email address, i.e. this should be:
> > > >     Reviewed-by: Paul Durrant <paul@xen.org>
> > > 
> > > Yes indeed it should. Don't know how I managed to screw that up... It's 
> > > not like haven't type that properly hundreds of times on Xen patch reviews.
> > 
> > All sorted in the tree I'm curating
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
> > 
> > 
> > Of those, three are marked as Cc:stable and want to go into the 6.1 release:
> > 
> >       KVM: x86/xen: Validate port number in SCHEDOP_poll
> >       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
> >       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
> > 
> > The rest (including the runstate compatibility fixes) are less
> > critical.
> > 
> > Sean, given that this now includes your patch series which in turn you
> > took over from Michal, how would you prefer me to proceed?
> 
> If you're ok with a slight delay for fixing the runstate stuff, I think it makes
> sense to push out everything else to fixes out to 6.3. 

I can live with that. Will post those three fixes.

>  At the very least, I want to explore using a gfn_to_hva_cache
> instead of two gfn_to_pfn_cache structures for the runstate, which at
> this point would be a bit rushed for 6.2.

As noted, I think we'd need two gfn_to_hva_caches anyway. But happy to
explore options.

> Pushing out to 6.3 will also give us time to do the more aggressive cleanup of
> adding kvm_gpc_lock(), which I think will yield APIs that both of us are happy with,
> i.e. the gpa+len of the cache can be immutable, but users can still validate their
> individual usage.

Yeah. The kvm_gpc_lock() thing is another one I just couldn't see a
clean one-size-fits-all solution for with irqsave and not-irqsave
variants, and different conditions (and other locks to drop) between
the check() failing and the refresh() occurring... but if we can make a
helper which covers the *common* case without ending up being just an
obfuscation, that's a win.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
  2022-11-19 11:38   ` Durrant, Paul
  2022-11-22 18:39   ` Sean Christopherson
@ 2022-11-22 23:46   ` Michal Luczaj
  2022-11-23  0:03     ` David Woodhouse
  2022-11-23 18:21   ` Sean Christopherson
  3 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2022-11-22 23:46 UTC (permalink / raw)
  To: David Woodhouse, Paolo Bonzini, Sean Christopherson; +Cc: kvm

On 11/19/22 10:46, David Woodhouse wrote:
> @@ -584,23 +705,57 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> ...
> +		if (IS_ENABLED(CONFIG_64BIT) && vcpu->kvm->arch.xen.long_mode)
> +			sz = sizeof(struct vcpu_runstate_info);
> +		else
> +			sz = sizeof(struct compat_vcpu_runstate_info);
> +
> +		/* How much fits in the (first) page? */
> +		sz1 = PAGE_SIZE - (data->u.gpa & ~PAGE_MASK);
>  		r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate_cache,
> -				     NULL, KVM_HOST_USES_PFN, data->u.gpa,
> -				     sizeof(struct vcpu_runstate_info));
> -		break;
> +				     NULL, KVM_HOST_USES_PFN, data->u.gpa, sz1);
> +		if (r)
> +			goto deactivate_out;
>  
> +		/* Either map the second page, or deactivate the second GPC */
> +		if (sz1 > sz) {

Out of curiosity: is there a reason behind potentially
kvm_gpc_activate()ing a "len=0" cache? (sz1==sz leads to sz2=0)
I feel I may be missing something, but shouldn't the comparison be >=?

> +			kvm_gpc_deactivate(vcpu->kvm,
> +					   &vcpu->arch.xen.runstate2_cache);
> +		} else {
> +			sz2 = sz - sz1;
> +			BUG_ON((data->u.gpa + sz1) & ~PAGE_MASK);
> +			r = kvm_gpc_activate(vcpu->kvm, &vcpu->arch.xen.runstate2_cache,
> +					     NULL, KVM_HOST_USES_PFN,
> +					     data->u.gpa + sz1, sz2);
> +			if (r)
> +				goto deactivate_out;
> +		}
thanks,
Michal


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-22 23:46   ` Michal Luczaj
@ 2022-11-23  0:03     ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-23  0:03 UTC (permalink / raw)
  To: Michal Luczaj, Paolo Bonzini, Sean Christopherson; +Cc: kvm

[-- Attachment #1: Type: text/plain, Size: 430 bytes --]

On Wed, 2022-11-23 at 00:46 +0100, Michal Luczaj wrote:
> 
> > +             /* Either map the second page, or deactivate the second GPC */
> > +             if (sz1 > sz) {
> 
> Out of curiosity: is there a reason behind potentially
> kvm_gpc_activate()ing a "len=0" cache? (sz1==sz leads to sz2=0)
> I feel I may be missing something, but shouldn't the comparison be >=?

No good reason. Fixed in my tree; thanks.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-22 23:04     ` David Woodhouse
@ 2022-11-23 17:17       ` Sean Christopherson
  2022-11-23 17:58         ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-23 17:17 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paolo Bonzini, kvm, mhal

On Tue, Nov 22, 2022, David Woodhouse wrote:
> On Tue, 2022-11-22 at 18:39 +0000, Sean Christopherson wrote:
> > On Sat, Nov 19, 2022, David Woodhouse wrote:
> > > From: David Woodhouse <
> > > dwmw@amazon.co.uk
> > > >
> > > 
> > > The guest runstate area can be arbitrarily byte-aligned. In fact, even
> > > when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
> > > fields in the structure end up being unaligned due to the fact that the
> > > 32-bit ABI only aligns them to 32 bits.
> > > 
> > > So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
> > > is buggy, because if it's unaligned then we can't update the whole field
> > > atomically; the low bytes might be observable before the _UPDATE bit is.
> > > Xen actually updates the *byte* containing that top bit, on its own. KVM
> > > should do the same.
> > 
> > I think we're using the wrong APIs to update the runstate.  The VMCS/VMCB pages
> > _need_ the host pfn, i.e. need to use a gpc (eventually).  The Xen PV stuff on the
> > other hand most definitely doesn't need to know the pfn.
> > 
> > The event channel code would be difficult to convert due to lack of uaccess
> > primitives, but I don't see anything in the runstate code that prevents KVM from
> > using a gfn_to_hva_cache.  That will naturally handle page splits by sending them
> > down a slow path and would yield far simpler code.
> > 
> > If taking the slow path is an issue, then the guest really should be fixed to not
> > split pages.  And if that's not an acceptable answer, the gfn_to_hva_cache code
> > could be updated to use the fast path if the region is contiguous in the host
> > virtual address space.
> > 
> 
> Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
> thing I tried. There are a handful of complexifying factors, none of
> which are insurmountable if we try hard enough.
> 
>  • Even if we fix the gfn_to_hva_cache to still use the fast path for
>    more than the first page, it's still possible for the runstate to
>    cross from one memslot to an adjacent one. We probably still need
>    two of them, which is a large part of the ugliness of this patch.

Hrm.  What if KVM requires that the runstate be contiguous in host virtual address
space?  That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't
break backwards compatibility since KVM doesn't even handle page splits, let alone
memslot splits.  AFAIK, most (all?) userspace VMMs make guest RAM virtually
contiguous anyways.

Probably a moot point since I don't see a way around the "page got swapped out"
issue.

>  • Accessing it via the userspace HVA requires coping with the case
>    where the vCPU is being scheduled out, and it can't sleep. The
>    previous code before commit 34d41d19e dealt with that by using 
>    pagefault_disable() and depending on the GHC fast path. But even
>    so...
> 
>  • We also could end up having to touch page B, page A then page B
>    again, potentially having to abort and leave the runstate with
>    the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
>    version which checks that both pages are present before it starts
>    writing anything to the guest.

Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B
getting swapped out.  Actually, it doesn't even require a B=>A=>B pattern.  Even
without a page split, the backing page could get swapped out between setting and
clearing.

> As I said, they're not insurmountable but that's why I ended up with
> the patch you see, after first attempting to use a gfn_to_hva_cache
> again. Happy to entertain suggestions.

What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE,
and then use kvm_vcpu_write_guest_atomic() if there's a page split?  That would
avoid needing to acquire mutliple gpc locks, and the update could use a single
code flow by always constructing an on-stack representation.  E.g. very roughly:

	*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
	smp_wmb();

	if (khva)
		memcpy(khva, rs_state, user_len);
	else
		kvm_vcpu_write_guest_in_atomic(vcpu, gpa, rs_state, user_len);
	smp_wmb();

	*update_bit = vx->runstate_entry_time >> 56;
	smp_wmb();

where "khva" is NULL if there's a page split.

> I note we have a kvm_read_guest_atomic() and briefly pondered adding a
> 'write' version of same... but that doesn't seem to cope with crossing
> memslots either; it also assumes its caller only uses it to access
> within a single page.

We should fix the lack of page split handling.  There are no bugs because KVM
only uses the helper for cases where the accesses are naturally aligned, but that's
bound to bite someone eventually.  That would be an oppurtunity to dedup the code
that handles "segments" (ugh), e.g. instead of

	gfn_t gfn = gpa >> PAGE_SHIFT;
	int seg;
	int offset = offset_in_page(gpa);
	int ret;

	while ((seg = next_segment(len, offset)) != 0) {
		ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
		if (ret < 0)
			return ret;
		offset = 0;
		len -= seg;
		data += seg;
		++gfn;
	}
	return 0;

provide a macro that allows

	kvm_for_each_chunk(...)
		ret = kvm_vcpu_read_guest_page(vcpu, gfn, data, offset, seg);
		if (ret)
			return ret;
	}
	
I also think we should rename the "atomic" helpers to be "in_atomic" (or inatomic
if we want to follow the kernel's terrible nomenclature, which I always read as
"not atomic").  I always think read_guest_atomic() means an "atomic read".

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 17:17       ` Sean Christopherson
@ 2022-11-23 17:58         ` David Woodhouse
  2022-11-23 18:16           ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-23 17:58 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 4176 bytes --]

On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, David Woodhouse wrote:
> > Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
> > thing I tried. There are a handful of complexifying factors, none of
> > which are insurmountable if we try hard enough.
> > 
> >  • Even if we fix the gfn_to_hva_cache to still use the fast path for
> >    more than the first page, it's still possible for the runstate to
> >    cross from one memslot to an adjacent one. We probably still need
> >    two of them, which is a large part of the ugliness of this patch.
> 
> Hrm.  What if KVM requires that the runstate be contiguous in host virtual address
> space?  That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't
> break backwards compatibility since KVM doesn't even handle page splits, let alone
> memslot splits.  AFAIK, most (all?) userspace VMMs make guest RAM virtually
> contiguous anyways.
> 
> Probably a moot point since I don't see a way around the "page got swapped out"
> issue.

Right.

> >  • Accessing it via the userspace HVA requires coping with the case
> >    where the vCPU is being scheduled out, and it can't sleep. The
> >    previous code before commit 34d41d19e dealt with that by using 
> >    pagefault_disable() and depending on the GHC fast path. But even
> >    so...
> > 
> >  • We also could end up having to touch page B, page A then page B
> >    again, potentially having to abort and leave the runstate with
> >    the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
> >    version which checks that both pages are present before it starts
> >    writing anything to the guest.
> 
> Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B
> getting swapped out.  Actually, it doesn't even require a B=>A=>B pattern.  Even
> without a page split, the backing page could get swapped out between setting and
> clearing.

Indeed. Which is why I quite like the "lock the whole lot down and then
do the update" in my existing patch.

> > As I said, they're not insurmountable but that's why I ended up with
> > the patch you see, after first attempting to use a gfn_to_hva_cache
> > again. Happy to entertain suggestions.
> 
> What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE,
> and then use kvm_vcpu_write_guest_atomic() if there's a page split?  That would
> avoid needing to acquire mutliple gpc locks, and the update could use a single
> code flow by always constructing an on-stack representation.  E.g. very roughly:

Even in the 'split' case, I don't think we want to be doing the full
memslot lookup for kvm_vcpu_write_guest_inatomic() every time we
schedule the guest vCPU in and out. We do want the cache, and a
kvm_vcpu_write_guest_cached_inatomic() function.

So instead of using two GPCs in the split case, that would mean we end
up using a gfn_to_pfn_cache for one page and a gfn_to_hva_cache for the
other?

And with or without that cache, we can *still* end up doing a partial
update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
might still be accessible, but bets are off about what state the rest
of the structure is in — and those runtimes are supposed to add up, or
the guest is going to get unhappy.

I'm actually OK with locking two GPCs. It wasn't my first choice, but
it's reasonable enough IMO given that none of the alternatives jump out
as being particularly attractive either.

> > I note we have a kvm_read_guest_atomic() and briefly pondered adding a
> > 'write' version of same... but that doesn't seem to cope with crossing
> > memslots either; it also assumes its caller only uses it to access
> > within a single page.
> 
> We should fix the lack of page split handling.

Agreed.

Even if it starts with just KVM_BUG_ON(offset + len >= PAGE_SIZE) to
avoid surprises.

I'd *love* to make the GPC cope with page splits, and just give us a
virtually contiguous mapping of up to N pages for some reasonable value
of N. I just can't see how to do that in the IOMEM case.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 17:58         ` David Woodhouse
@ 2022-11-23 18:16           ` Sean Christopherson
  2022-11-23 18:48             ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-23 18:16 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paolo Bonzini, kvm, mhal

On Wed, Nov 23, 2022, David Woodhouse wrote:
> On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> And with or without that cache, we can *still* end up doing a partial
> update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
> might still be accessible, but bets are off about what state the rest
> of the structure is in — and those runtimes are supposed to add up, or
> the guest is going to get unhappy.

Ugh.  What a terrible ABI.  

> I'm actually OK with locking two GPCs. It wasn't my first choice, but
> it's reasonable enough IMO given that none of the alternatives jump out
> as being particularly attractive either.

I detest the two GPCs, but since KVM apparently needs to provide "all or nothing"
updates, I don't see a better option.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
                     ` (2 preceding siblings ...)
  2022-11-22 23:46   ` Michal Luczaj
@ 2022-11-23 18:21   ` Sean Christopherson
  2022-11-23 19:22     ` David Woodhouse
  3 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-23 18:21 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paolo Bonzini, kvm, mhal

On Sat, Nov 19, 2022, David Woodhouse wrote:
> +		/*
> +		 * Use kvm_gpc_activate() here because if the runstate
> +		 * area was configured in 32-bit mode and only extends
> +		 * to the second page now because the guest changed to
> +		 * 64-bit mode, the second GPC won't have been set up.
> +		 */
> +		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
> +				     gpc1->gpa + user_len1, user_len2))

I believe kvm_gpc_activate() needs to be converted from write_lock_irq() to
write_lock_irqsave() for this to be safe.

Side topic, why do all of these flows disable IRQs?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 18:16           ` Sean Christopherson
@ 2022-11-23 18:48             ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-23 18:48 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 1671 bytes --]

On Wed, 2022-11-23 at 18:16 +0000, Sean Christopherson wrote:
> On Wed, Nov 23, 2022, David Woodhouse wrote:
> > On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> > And with or without that cache, we can *still* end up doing a partial
> > update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
> > might still be accessible, but bets are off about what state the rest
> > of the structure is in — and those runtimes are supposed to add up, or
> > the guest is going to get unhappy.
> 
> Ugh.  What a terrible ABI.  
> 
> > I'm actually OK with locking two GPCs. It wasn't my first choice, but
> > it's reasonable enough IMO given that none of the alternatives jump out
> > as being particularly attractive either.
> 
> I detest the two GPCs, but since KVM apparently needs to provide "all or nothing"
> updates, I don't see a better option.

Actually I think it might be not be so awful to do a contiguous virtual
(kernel) mapping of multiple discontiguous IOMEM PFNs. A kind of
memremapv().

We can open-code something like ioremap_prot(), doing a single call to
get_vm_area_caller() for the whole virtual size, then individually
calling ioremap_page_range() for each page.

So we *could* fix the GPC to cope with ranges which cross more than a
single page. But if the runstate area is the only user for that, as
seems to be the case so far, then it might be a bit too much additional
complexity in an area which is fun enough already.

I'll propose that we go ahead with the two-GPCs model for now, and if
we ever need to do that *again*, then we look harder at making the GPC
support multiple pages? 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 18:21   ` Sean Christopherson
@ 2022-11-23 19:22     ` David Woodhouse
  2022-11-23 19:32       ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-23 19:22 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 1993 bytes --]

On Wed, 2022-11-23 at 18:21 +0000, Sean Christopherson wrote:
> On Sat, Nov 19, 2022, David Woodhouse wrote:
...
		/* When invoked from kvm_sched_out() we cannot sleep */
		if (atomic)
			return;
...
> > +		/*
> > +		 * Use kvm_gpc_activate() here because if the runstate
> > +		 * area was configured in 32-bit mode and only extends
> > +		 * to the second page now because the guest changed to
> > +		 * 64-bit mode, the second GPC won't have been set up.
> > +		 */
> > +		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
> > +				     gpc1->gpa + user_len1, user_len2))
> 
> I believe kvm_gpc_activate() needs to be converted from write_lock_irq() to
> write_lock_irqsave() for this to be safe.

Hm, not sure I concur. You're only permitted to call kvm_gpc_activate()
in a context where you can sleep. Interrupts had better not be disabled
already, and it had better not need write_lock_irqsave().

In this particular context, we do drop all locks before calling
kvm_gpc_activate(), and we don't call it at all in the scheduling-out
case when we aren't permitted to sleep.

> Side topic, why do all of these flows disable IRQs?

The gpc->lock is permitted to be taken in IRQ context by the users of
the GPC. For example we do this for the shared_info and vcpu_info when
passing interrupts directly through to the guest via
kvm_arch_set_irq_inatomic() → kvm_xen_set_evtchn_fast().

The code path is fairly convoluted but I do believe we get there
directly from the VFIO IRQ handler, via the custom wakeup handler on
the irqfd's waitqueue (irqfd_wakeup).

Now, perhaps a GPC user which knows that *it* is not going to use its
GPC from IRQ context, could refrain from bothering to disable
interrupts when taking its own gpc->lock. And that's probably true for
the runstate area.

The generic GPC code still needs to disable interrupts when taking a
gpc->lock though, unless we want to add yet another flag to control
that behaviour.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 19:22     ` David Woodhouse
@ 2022-11-23 19:32       ` Sean Christopherson
  2022-11-23 20:07         ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2022-11-23 19:32 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paolo Bonzini, kvm, mhal

On Wed, Nov 23, 2022, David Woodhouse wrote:
> On Wed, 2022-11-23 at 18:21 +0000, Sean Christopherson wrote:
> > On Sat, Nov 19, 2022, David Woodhouse wrote:
> ...
> 		/* When invoked from kvm_sched_out() we cannot sleep */
> 		if (atomic)
> 			return;
> ...
> > > +		/*
> > > +		 * Use kvm_gpc_activate() here because if the runstate
> > > +		 * area was configured in 32-bit mode and only extends
> > > +		 * to the second page now because the guest changed to
> > > +		 * 64-bit mode, the second GPC won't have been set up.
> > > +		 */
> > > +		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
> > > +				     gpc1->gpa + user_len1, user_len2))
> > 
> > I believe kvm_gpc_activate() needs to be converted from write_lock_irq() to
> > write_lock_irqsave() for this to be safe.
> 
> Hm, not sure I concur. You're only permitted to call kvm_gpc_activate()
> in a context where you can sleep. Interrupts had better not be disabled
> already, and it had better not need write_lock_irqsave().
> 
> In this particular context, we do drop all locks before calling
> kvm_gpc_activate(), and we don't call it at all in the scheduling-out
> case when we aren't permitted to sleep.

Oh, duh, there's an "if (atomic)" check right above this.

> > Side topic, why do all of these flows disable IRQs?
> 
> The gpc->lock is permitted to be taken in IRQ context by the users of
> the GPC. For example we do this for the shared_info and vcpu_info when
> passing interrupts directly through to the guest via
> kvm_arch_set_irq_inatomic() → kvm_xen_set_evtchn_fast().
> 
> The code path is fairly convoluted but I do believe we get there
> directly from the VFIO IRQ handler, via the custom wakeup handler on
> the irqfd's waitqueue (irqfd_wakeup).

Yeah, I remember finding that flow.

> Now, perhaps a GPC user which knows that *it* is not going to use its
> GPC from IRQ context, could refrain from bothering to disable
> interrupts when taking its own gpc->lock. And that's probably true for
> the runstate area.

This is effectively what I was asking about.

> The generic GPC code still needs to disable interrupts when taking a
> gpc->lock though, unless we want to add yet another flag to control
> that behaviour.

Right.  Might be worth adding a comment at some point to call out that disabling
IRQs may not be strictly required for all users, but it's done for simplicity.
Ah, if/when we add kvm_gpc_lock(), that would be the perfect place to document
the behavior.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 19:32       ` Sean Christopherson
@ 2022-11-23 20:07         ` David Woodhouse
  2022-11-23 20:26           ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-23 20:07 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 3196 bytes --]

On Wed, 2022-11-23 at 19:32 +0000, Sean Christopherson wrote:
> On Wed, Nov 23, 2022, David Woodhouse wrote:
> > Now, perhaps a GPC user which knows that *it* is not going to use its
> > GPC from IRQ context, could refrain from bothering to disable
> > interrupts when taking its own gpc->lock. And that's probably true for
> > the runstate area.
> 
> This is effectively what I was asking about.

So yes, I suppose we could do that. Incrementally...

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index af5ac495feec..5b93c3ed137c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -177,7 +177,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
 	int *rs_state = &rs.state;
-	unsigned long flags;
 	size_t times_ofs;
 	u8 *update_bit;
 
@@ -236,9 +235,9 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * Attempt to obtain the GPC lock on *both* (if there are two)
 	 * gfn_to_pfn caches that cover the region.
 	 */
-	read_lock_irqsave(&gpc1->lock, flags);
+	read_lock(&gpc1->lock);
 	while (!kvm_gpc_check(gpc1, user_len1)) {
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -247,7 +246,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		if (kvm_gpc_refresh(gpc1, user_len1))
 			return;
 
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 
 	/*
@@ -337,7 +336,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 	if (!kvm_gpc_check(gpc2, user_len2)) {
 		read_unlock(&gpc2->lock);
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -399,7 +398,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	if (user_len2)
 		read_unlock(&gpc2->lock);
  done_1:
-	read_unlock_irqrestore(&gpc1->lock, flags);
+	read_unlock(&gpc1->lock);
 
 	mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);
 	if (user_len2)

I don't know how I feel about that. I resonate with what you say below
about "not strictly required for all users, but it's done for
simplicity". This code doesn't have enough simplicity. Handling the
runstate GPC as a special case just because we can, doesn't necessarily
fill me with joy for the amount of optimisation that it brings.

> > The generic GPC code still needs to disable interrupts when taking a
> > gpc->lock though, unless we want to add yet another flag to control
> > that behaviour.
> 
> Right.  Might be worth adding a comment at some point to call out that disabling
> IRQs may not be strictly required for all users, but it's done for simplicity.
> Ah, if/when we add kvm_gpc_lock(), that would be the perfect place to document
> the behavior.

Yeah. Or perhaps the kvm_gpc_lock() should go with that 'not required
for all users, but done for simplicity' angle too, and always disable
IRQs?


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area
  2022-11-23 20:07         ` David Woodhouse
@ 2022-11-23 20:26           ` Sean Christopherson
  0 siblings, 0 replies; 32+ messages in thread
From: Sean Christopherson @ 2022-11-23 20:26 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Paolo Bonzini, kvm, mhal

On Wed, Nov 23, 2022, David Woodhouse wrote:
> On Wed, 2022-11-23 at 19:32 +0000, Sean Christopherson wrote:
> > Right.  Might be worth adding a comment at some point to call out that disabling
> > IRQs may not be strictly required for all users, but it's done for simplicity.
> > Ah, if/when we add kvm_gpc_lock(), that would be the perfect place to document
> > the behavior.
> 
> Yeah. Or perhaps the kvm_gpc_lock() should go with that 'not required
> for all users, but done for simplicity' angle too, and always disable
> IRQs?

I was thinking the latter (always disable IRQs in kvm_gpc_lock()).  Sorry I didn't
make that clear.  I completely agree that fewer conditionals in this code is better,
I was mostly trying to figure out if there is some edge case I was missing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-22 17:02         ` David Woodhouse
  2022-11-22 19:09           ` Sean Christopherson
@ 2022-11-24  0:31           ` Paolo Bonzini
  2022-11-24  0:45             ` David Woodhouse
  2022-11-24  0:54             ` David Woodhouse
  1 sibling, 2 replies; 32+ messages in thread
From: Paolo Bonzini @ 2022-11-24  0:31 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paul Durrant, Sean Christopherson, Durrant, Paul, kvm, mhal

On Tue, Nov 22, 2022 at 6:25 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > Tags need your real name, not just your email address, i.e. this should be:
> > >     Reviewed-by: Paul Durrant <paul@xen.org>
> >
> > Yes indeed it should. Don't know how I managed to screw that up... It's
> > not like haven't type that properly hundreds of times on Xen patch reviews.
>
> All sorted in the tree I'm curating
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
>
> Of those, three are marked as Cc:stable and want to go into the 6.1 release:
>
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>
> The rest (including the runstate compatibility fixes) are less
> critical.

I have picked them into both kvm/master and kvm/queue.

The gpc series probably will be left for 6.3. I had already removed Sean's
bits for the gpc and will rebase on top of your runstate compatibility fixes,
which I'm cherry-picking into kvm/queue.

But wow, is that runstate compatibility patch ugly.  Is it really worth it
having the two separate update paths, one which is ugly because of
BUILD_BUG_ON assertions and one which is ugly because of the
two-page stuff?

Like this (sorry about any word-wrapping, I'll repost it properly
after testing):

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b246decb53a9..873a0ded3822 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -205,7 +205,7 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
 #ifdef CONFIG_X86_64
         /*
          * Don't leak kernel memory through the padding in the 64-bit
-         * struct if we take the split-page path.
+         * struct.
          */
         memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
 #endif
@@ -251,83 +251,6 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         read_lock_irqsave(&gpc1->lock, flags);
     }

-    /*
-     * The common case is that it all fits on a page and we can
-     * just do it the simple way.
-     */
-    if (likely(!user_len2)) {
-        /*
-         * We use 'int *user_state' to point to the state field, and
-         * 'u64 *user_times' for runstate_entry_time. So the actual
-         * array of time[] in each state starts at user_times[1].
-         */
-        int *user_state = gpc1->khva;
-        u64 *user_times = gpc1->khva + times_ofs;
-
-        /*
-         * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
-         * field. We need to set it (and write-barrier) before writing to the
-         * the rest of the structure, and clear it last. Just as Xen does, we
-         * address the single *byte* in which it resides because it might be
-         * in a different cache line to the rest of the 64-bit word, due to
-         * the (lack of) alignment constraints.
-         */
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info,
state_entry_time) !=
-                 sizeof(uint64_t));
-        BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info,
state_entry_time) !=
-                 sizeof(uint64_t));
-        BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
-
-        update_bit = ((u8 *)(&user_times[1])) - 1;
-        *update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
-        smp_wmb();
-
-        /*
-         * Next, write the new runstate. This is in the *same* place
-         * for 32-bit and 64-bit guests, asserted here for paranoia.
-         */
-        BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-                 offsetof(struct compat_vcpu_runstate_info, state));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-                 sizeof(vx->current_runstate));
-        BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-                 sizeof(vx->current_runstate));
-        *user_state = vx->current_runstate;
-
-        /*
-         * Then the actual runstate_entry_time (with the UPDATE bit
-         * still set).
-         */
-        *user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-
-        /*
-         * Write the actual runstate times immediately after the
-         * runstate_entry_time.
-         */
-        BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-                 offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-        BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info,
state_entry_time) !=
-                 offsetof(struct compat_vcpu_runstate_info, time) -
sizeof(u64));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-                 sizeof_field(struct compat_vcpu_runstate_info, time));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-                 sizeof(vx->runstate_times));
-        memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
-
-        smp_wmb();
-
-        /*
-         * Finally, clear the 'updating' bit. Don't use &= here because
-         * the compiler may not realise that update_bit and user_times
-         * point to the same place. That's a classic pointer-aliasing
-         * problem.
-         */
-        *update_bit = vx->runstate_entry_time >> 56;
-        smp_wmb();
-
-        goto done_1;
-    }
-
     /*
      * The painful code path. It's split across two pages and we need to
      * hold and validate both GPCs simultaneously. Thankfully we can get
@@ -336,7 +259,7 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
      */
     read_lock(&gpc2->lock);

-    if (!kvm_gpc_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+    if (user_len2 && !kvm_gpc_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
         read_unlock(&gpc2->lock);
         read_unlock_irqrestore(&gpc1->lock, flags);

@@ -361,6 +284,20 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         goto retry;
     }

+    /*
+     * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
+     * field. We need to set it (and write-barrier) before writing to the
+     * the rest of the structure, and clear it last. Just as Xen does, we
+     * address the single *byte* in which it resides because it might be
+     * in a different cache line to the rest of the 64-bit word, due to
+     * the (lack of) alignment constraints.
+     */
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+             sizeof(uint64_t));
+    BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info,
state_entry_time) !=
+             sizeof(uint64_t));
+    BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+
     /*
      * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
      */
@@ -370,6 +307,17 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
             user_len1;

+    /*
+     * Next, write the new runstate. This is in the *same* place
+     * for 32-bit and 64-bit guests, asserted here for paranoia.
+     */
+    BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+             offsetof(struct compat_vcpu_runstate_info, state));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+             sizeof(vx->current_runstate));
+    BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+             sizeof(vx->current_runstate));
+
     /*
      * Create a structure on our stack with everything in the right place.
      * The rs_state pointer points to the start of it, which in the case
@@ -378,29 +326,44 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
      */
     *rs_state = vx->current_runstate;
     rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-    memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
-
     *update_bit = rs.state_entry_time >> 56;
     smp_wmb();

+    /*
+     * Write the actual runstate times immediately after the
+     * runstate_entry_time.
+     */
+    BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+             offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+    BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info,
state_entry_time) !=
+             offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+             sizeof_field(struct compat_vcpu_runstate_info, time));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+             sizeof(vx->runstate_times));
+    memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
+
     /*
      * Having constructed the structure starting at *rs_state, copy it
      * into the first and second pages as appropriate using user_len1
      * and user_len2.
      */
     memcpy(gpc1->khva, rs_state, user_len1);
-    memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+    if (user_len2)
+        memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
     smp_wmb();

     /*
-     * Finally, clear the XEN_RUNSTATE_UPDATE bit.
+     * Finally, clear the 'updating' bit. Don't use &= here because
+     * the compiler may not realise that update_bit and user_times
+     * point to the same place. That's a classic pointer-aliasing
+     * problem.
      */
     *update_bit = vx->runstate_entry_time >> 56;
     smp_wmb();

     if (user_len2)
         read_unlock(&gpc2->lock);
- done_1:
     read_unlock_irqrestore(&gpc1->lock, flags);

     mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);


? Yet another possibility is to introduce a

/* Copy from src to dest_ofs bytes into the combined area pointed to by
 * dest1 (up to dest1_len bytes) and dest2 (the rest). */
void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
dest1_len, void *src, size_t src_len)

so that the on-stack struct is not needed at all. This makes it possible to
avoid the rs_state hack as well.

It's in kvm/queue only, so there's time to include a new version.

Paolo

> Sean, given that this now includes your patch series which in turn you
> took over from Michal, how would you prefer me to proceed?
>
> David Woodhouse (7):
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: x86/xen: Add CPL to Xen hypercall tracepoint
>       MAINTAINERS: Add KVM x86/xen maintainer list
>       KVM: x86/xen: Compatibility fixes for shared runstate area
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>       KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
>
> Michal Luczaj (6):
>       KVM: Shorten gfn_to_pfn_cache function names
>       KVM: x86: Remove unused argument in gpc_unmap_khva()
>       KVM: Store immutable gfn_to_pfn_cache properties
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
>       KVM: Clean up hva_to_pfn_retry()
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()
>
> Sean Christopherson (4):
>       KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
>       KVM: Do not partially reinitialize gfn=>pfn cache during activation
>       KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
>       KVM: Skip unnecessary "unmap" if gpc is already valid during refresh
>
> We can reinstate the 'immutable length' thing on top, if we pick one of
> the discussed options for coping with the fact that for the runstate
> area, it *isn't* immutable. I'm slightly leaning towards just setting
> the length to '1' despite it being a lie.


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-24  0:31           ` Paolo Bonzini
@ 2022-11-24  0:45             ` David Woodhouse
  2022-11-24  0:54             ` David Woodhouse
  1 sibling, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-24  0:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Paul Durrant, Sean Christopherson, Durrant, Paul, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 1250 bytes --]

On Thu, 2022-11-24 at 01:31 +0100, Paolo Bonzini wrote:
> I have picked them into both kvm/master and kvm/queue.

Thanks.

> The gpc series probably will be left for 6.3. I had already removed
> Sean's bits for the gpc and will rebase on top of your runstate
> compatibility fixes, which I'm cherry-picking into kvm/queue.
> 
> But wow, is that runstate compatibility patch ugly.  Is it really
> worth it having the two separate update paths, one which is ugly
> because of BUILD_BUG_ON assertions and one which is ugly because of
> the two-page stuff?

The BUILD_BUG_ON() assertions could move. I quite liked having them
there as documentation for compat handling, but I'm happy to move them.
There's plenty of *actual* comments on the compat handling too ;)

I do think it's worth the separate paths because the *common* case
doesn't have to bounce it through the kernel stack at all, and it's a
relatively fast path because it happens on each schedule in/out.

The fast path isn't the part I hate. Removing *that* doesn't really
make things much nicer.

If you want to move the BUILD_BUG_ONs into the slow path and then we
can concentrate our hate on that part and leave the fast path even
nicer, that's fine though ;)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-24  0:31           ` Paolo Bonzini
  2022-11-24  0:45             ` David Woodhouse
@ 2022-11-24  0:54             ` David Woodhouse
  2022-11-24  1:11               ` Paolo Bonzini
  1 sibling, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-24  0:54 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Paul Durrant, Sean Christopherson, Durrant, Paul, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 616 bytes --]

On Thu, 2022-11-24 at 01:31 +0100, Paolo Bonzini wrote:
> Yet another possibility is to introduce a
> 
> /* Copy from src to dest_ofs bytes into the combined area pointed to by
>  * dest1 (up to dest1_len bytes) and dest2 (the rest). */
> void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
> dest1_len, void *src, size_t src_len)
> 
> so that the on-stack struct is not needed at all. This makes it possible to
> avoid the rs_state hack as well.

FWIW I did one of those in 
https://lore.kernel.org/kvm/6acfdbd3e516ece5949e97c85e7db8dc97a3e6c6.camel@infradead.org/

Hated that too :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-24  0:54             ` David Woodhouse
@ 2022-11-24  1:11               ` Paolo Bonzini
  2022-11-24 10:41                 ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2022-11-24  1:11 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paul Durrant, Sean Christopherson, Durrant, Paul, kvm, mhal

On Thu, Nov 24, 2022 at 1:55 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Thu, 2022-11-24 at 01:31 +0100, Paolo Bonzini wrote:
> > Yet another possibility is to introduce a
> >
> > /* Copy from src to dest_ofs bytes into the combined area pointed to by
> >  * dest1 (up to dest1_len bytes) and dest2 (the rest). */
> > void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
> > dest1_len, void *src, size_t src_len)
> >
> > so that the on-stack struct is not needed at all. This makes it possible to
> > avoid the rs_state hack as well.
>
> FWIW I did one of those in
> https://lore.kernel.org/kvm/6acfdbd3e516ece5949e97c85e7db8dc97a3e6c6.camel@infradead.org/
>
> Hated that too :)

Aha, that may be where I got the idea. Won't win a beauty contest indeed.

What if vx contained a struct vcpu_runstate_info instead of having
separate fields? That might combine the best of the on-stack version
and the best of the special-memcpy version...

Going to bed now anyway.

Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-24  1:11               ` Paolo Bonzini
@ 2022-11-24 10:41                 ` David Woodhouse
  2022-11-24 11:30                   ` David Woodhouse
  0 siblings, 1 reply; 32+ messages in thread
From: David Woodhouse @ 2022-11-24 10:41 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Paul Durrant, Sean Christopherson, Durrant, Paul, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 15276 bytes --]

On Thu, 2022-11-24 at 02:11 +0100, Paolo Bonzini wrote:
> 
> What if vx contained a struct vcpu_runstate_info instead of having
> separate fields? That might combine the best of the on-stack version
> and the best of the special-memcpy version...

Well... the state field is in a different place for compat vs. 64-bit
which makes that fun, and the state_entry_time field need to have the
XEN_RUNSTATE_UPDATE bit masked into it which means we don't really want
to just memcpy *that* part... and we already *are* just doing a memcpy
for the rest, which is the time[] array.

But we *can* recognise that the fast and slow paths aren't all that
different really, and combine them. How's this?

From bf9d04e1c59771c82d5694d22a86fb0508be1530 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Date: Thu, 24 Nov 2022 09:49:00 +0000
Subject: [PATCH] KVM: x86/xen: Reconcile fast and slow paths for runstate
 update

Instead of having a completely separate fast path for the case where the
guest's runstate_info fits entirely in a page, recognise the similarity.

In both cases, the @update_bit pointer points to the byte containing the
XEN_RUNSTATE_UPDAT flag directly in the guest, via one of the GPCs.

In both cases, the actual guest structure (compat or not) is built up
from the fields in @vx, following a preset pointer. The only difference
is whether that pointer points to the kernel stack (in the split case)
or to guest memory directly via the GPC.

All that can be the same if we just set the pointers up accordingly for
each case. Then the only real difference is that dual memcpy, and the
whole of that original fast path can go away, disappearing into the slow
path without actually doing the slow part.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 249 +++++++++++++++++++++++----------------------
 1 file changed, 126 insertions(+), 123 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index d4077262c0a2..eaa7fddfbb8c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -177,44 +177,78 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
-	int *rs_state = &rs.state;
 	unsigned long flags;
 	size_t times_ofs;
-	u8 *update_bit;
+	uint8_t *update_bit;
+	uint64_t *rs_times;
+	int *rs_state;
 
 	/*
 	 * The only difference between 32-bit and 64-bit versions of the
 	 * runstate struct is the alignment of uint64_t in 32-bit, which
 	 * means that the 64-bit version has an additional 4 bytes of
-	 * padding after the first field 'state'.
+	 * padding after the first field 'state'. Let's be really really
+	 * paranoid about that, and matching it with our internal data
+	 * structures that we memcpy into it...
 	 */
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) != 0);
 	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state) != 0);
 	BUILD_BUG_ON(sizeof(struct compat_vcpu_runstate_info) != 0x2c);
 #ifdef CONFIG_X86_64
+	/*
+	 * The 64-bit structure has 4 bytes of padding before 'state_entry_time'
+	 * so each subsequent field is shifted by 4, and it's 4 bytes longer.
+	 */
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
 		     offsetof(struct compat_vcpu_runstate_info, state_entry_time) + 4);
 	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, time) !=
 		     offsetof(struct compat_vcpu_runstate_info, time) + 4);
+	BUILD_BUG_ON(sizeof(struct vcpu_runstate_info) != 0x2c + 4);
 #endif
+	/*
+	 * The state field is in the same place at the start of both structs,
+	 * and is the same size (int) as vx->current_runstate.
+	 */
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+		     offsetof(struct compat_vcpu_runstate_info, state));
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+		     sizeof(vx->current_runstate));
+	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+		     sizeof(vx->current_runstate));
+
+	/*
+	 * The state_entry_time field is 64 bits in both versions, and the
+	 * XEN_RUNSTATE_UPDATE flag is in the top bit, which given that x86
+	 * is little-endian means that it's in the last *byte* of the word.
+	 * That detail is important later.
+	 */
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+		     sizeof(uint64_t));
+	BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
+		     sizeof(uint64_t));
+	BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+
+	/*
+	 * The time array is four 64-bit quantities in both versions, matching
+	 * the vx->runstate_times and immediately following state_entry_time.
+	 */
+	BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+		     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+	BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
+		     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+		     sizeof_field(struct compat_vcpu_runstate_info, time));
+	BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+		     sizeof(vx->runstate_times));
 
 	if (IS_ENABLED(CONFIG_64BIT) && v->kvm->arch.xen.long_mode) {
 		user_len = sizeof(struct vcpu_runstate_info);
 		times_ofs = offsetof(struct vcpu_runstate_info,
 				     state_entry_time);
-#ifdef CONFIG_X86_64
-		/*
-		 * Don't leak kernel memory through the padding in the 64-bit
-		 * struct if we take the split-page path.
-		 */
-		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
-#endif
 	} else {
 		user_len = sizeof(struct compat_vcpu_runstate_info);
 		times_ofs = offsetof(struct compat_vcpu_runstate_info,
 				     state_entry_time);
-		/* Start of struct for compat mode (qv). */
-		rs_state++;
 	}
 
 	/*
@@ -251,145 +285,114 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		read_lock_irqsave(&gpc1->lock, flags);
 	}
 
-	/*
-	 * The common case is that it all fits on a page and we can
-	 * just do it the simple way.
-	 */
 	if (likely(!user_len2)) {
 		/*
-		 * We use 'int *user_state' to point to the state field, and
-		 * 'u64 *user_times' for runstate_entry_time. So the actual
-		 * array of time[] in each state starts at user_times[1].
+		 * Set up three pointers directly to the runstate_info
+		 * struct in the guest (via the GPC).
+		 *
+		 *  • @rs_state   → state field
+		 *  • @rs_times   → state_entry_time field.
+		 *  • @update_bit → last byte of state_entry_time, which
+		 *                  contains the XEN_RUNSTATE_UPDATE bit.
 		 */
-		int *user_state = gpc1->khva;
-		u64 *user_times = gpc1->khva + times_ofs;
-
+		rs_state = gpc1->khva;
+		rs_times = gpc1->khva + times_ofs;
+		update_bit = ((uint8_t *)(&rs_times[1])) - 1;
+	} else {
 		/*
-		 * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
-		 * field. We need to set it (and write-barrier) before writing to the
-		 * the rest of the structure, and clear it last. Just as Xen does, we
-		 * address the single *byte* in which it resides because it might be
-		 * in a different cache line to the rest of the 64-bit word, due to
-		 * the (lack of) alignment constraints.
+		 * The guest's runstate_info is split across two pages and we
+		 * need to hold and validate both GPCs simultaneously. We can
+		 * declare a lock ordering GPC1 > GPC2 because nothing else
+		 * takes them more than one at a time.
 		 */
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
-			     sizeof(uint64_t));
-		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state_entry_time) !=
-			     sizeof(uint64_t));
-		BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+		read_lock(&gpc2->lock);
 
-		update_bit = ((u8 *)(&user_times[1])) - 1;
-		*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
-		smp_wmb();
+		if (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+			read_unlock(&gpc2->lock);
+			read_unlock_irqrestore(&gpc1->lock, flags);
 
-		/*
-		 * Next, write the new runstate. This is in the *same* place
-		 * for 32-bit and 64-bit guests, asserted here for paranoia.
-		 */
-		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-			     offsetof(struct compat_vcpu_runstate_info, state));
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-			     sizeof(vx->current_runstate));
-		BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-			     sizeof(vx->current_runstate));
-		*user_state = vx->current_runstate;
+			/* When invoked from kvm_sched_out() we cannot sleep */
+			if (atomic)
+				return;
 
-		/*
-		 * Then the actual runstate_entry_time (with the UPDATE bit
-		 * still set).
-		 */
-		*user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+			/*
+			 * Use kvm_gpc_activate() here because if the runstate
+			 * area was configured in 32-bit mode and only extends
+			 * to the second page now because the guest changed to
+			 * 64-bit mode, the second GPC won't have been set up.
+			 */
+			if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
+					     gpc1->gpa + user_len1, user_len2))
+				return;
+
+			/*
+			 * We dropped the lock on GPC1 so we have to go all the
+			 * way back and revalidate that too.
+			 */
+			goto retry;
+		}
 
 		/*
-		 * Write the actual runstate times immediately after the
-		 * runstate_entry_time.
+		 * In this case, the runstate_info struct will be assembled on
+		 * the kernel stack (compat or not as appropriate) and will
+		 * be copied to GPC1/GPC2 with a dual memcpy. Set up the three
+		 * rs pointers accordingly.
 		 */
-		BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-			     offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-		BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info, state_entry_time) !=
-			     offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-			     sizeof_field(struct compat_vcpu_runstate_info, time));
-		BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-			     sizeof(vx->runstate_times));
-		memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
-
-		smp_wmb();
+		rs_times = &rs.state_entry_time;
 
 		/*
-		 * Finally, clear the 'updating' bit. Don't use &= here because
-		 * the compiler may not realise that update_bit and user_times
-		 * point to the same place. That's a classic pointer-aliasing
-		 * problem.
+		 * The rs_state pointer points to the start of what we'll
+		 * copy to the guest, which in the case of a compat guest
+		 * is the 32-bit field that the compiler thinks is padding.
 		 */
-		*update_bit = vx->runstate_entry_time >> 56;
-		smp_wmb();
-
-		goto done_1;
-	}
-
-	/*
-	 * The painful code path. It's split across two pages and we need to
-	 * hold and validate both GPCs simultaneously. Thankfully we can get
-	 * away with declaring a lock ordering GPC1 > GPC2 because nothing
-	 * else takes them more than one at a time.
-	 */
-	read_lock(&gpc2->lock);
-
-	if (!kvm_gfn_to_pfn_cache_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
-		read_unlock(&gpc2->lock);
-		read_unlock_irqrestore(&gpc1->lock, flags);
-
-		/* When invoked from kvm_sched_out() we cannot sleep */
-		if (atomic)
-			return;
+		rs_state = ((void *)rs_times) - times_ofs;
 
 		/*
-		 * Use kvm_gpc_activate() here because if the runstate
-		 * area was configured in 32-bit mode and only extends
-		 * to the second page now because the guest changed to
-		 * 64-bit mode, the second GPC won't have been set up.
+		 * The update_bit is still directly in the guest memory,
+		 * via one GPC or the other.
 		 */
-		if (kvm_gpc_activate(v->kvm, gpc2, NULL, KVM_HOST_USES_PFN,
-				     gpc1->gpa + user_len1, user_len2))
-			return;
+		if (user_len1 >= times_ofs + sizeof(uint64_t))
+			update_bit = ((u8 *)gpc1->khva) + times_ofs +
+				sizeof(uint64_t) - 1;
+		else
+			update_bit = ((u8 *)gpc2->khva) + times_ofs +
+				sizeof(uint64_t) - 1 - user_len1;
 
+#ifdef CONFIG_X86_64
 		/*
-		 * We dropped the lock on GPC1 so we have to go all the way
-		 * back and revalidate that too.
+		 * Don't leak kernel memory through the padding in the 64-bit
+		 * version of the struct.
 		 */
-		goto retry;
+		memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
+#endif
 	}
 
 	/*
-	 * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
+	 * First, set the XEN_RUNSTATE_UPDATE bit in the top bit of the
+	 * state_entry_time field, directly in the guest. We need to set
+	 * that (and write-barrier) before writing to the rest of the
+	 * structure, and clear it last. Just as Xen does, we address the
+	 * single *byte* in which it resides because it might be in a
+	 * different cache line to the rest of the 64-bit word, due to
+	 * the (lack of) alignment constraints.
 	 */
-	if (user_len1 >= times_ofs + sizeof(uint64_t))
-		update_bit = ((u8 *)gpc1->khva) + times_ofs + sizeof(u64) - 1;
-	else
-		update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
-			user_len1;
+	*update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
+	smp_wmb();
 
 	/*
-	 * Create a structure on our stack with everything in the right place.
-	 * The rs_state pointer points to the start of it, which in the case
-	 * of a compat guest on a 64-bit host is the 32 bit field that the
-	 * compiler thinks is padding.
+	 * Now assemble the actual structure, either on our kernel stack
+	 * or directly in the guest according to how the rs_state and
+	 * rs_times pointers were set up above.
 	 */
 	*rs_state = vx->current_runstate;
-	rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-	memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
-
-	*update_bit = rs.state_entry_time >> 56;
-	smp_wmb();
+	rs_times[0] = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
+	memcpy(rs_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
 
-	/*
-	 * Having constructed the structure starting at *rs_state, copy it
-	 * into the first and second pages as appropriate using user_len1
-	 * and user_len2.
-	 */
-	memcpy(gpc1->khva, rs_state, user_len1);
-	memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+	/* For the split case, we have to then copy it to the guest. */
+	if (user_len2) {
+		memcpy(gpc1->khva, rs_state, user_len1);
+		memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+	}
 	smp_wmb();
 
 	/*
@@ -400,7 +403,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 	if (user_len2)
 		read_unlock(&gpc2->lock);
- done_1:
+
 	read_unlock_irqrestore(&gpc1->lock, flags);
 
 	mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);
-- 
2.25.1



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page
  2022-11-24 10:41                 ` David Woodhouse
@ 2022-11-24 11:30                   ` David Woodhouse
  0 siblings, 0 replies; 32+ messages in thread
From: David Woodhouse @ 2022-11-24 11:30 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Paul Durrant, Sean Christopherson, Durrant, Paul, kvm, mhal

[-- Attachment #1: Type: text/plain, Size: 1226 bytes --]

On Thu, 2022-11-24 at 10:41 +0000, David Woodhouse wrote:
> On Thu, 2022-11-24 at 02:11 +0100, Paolo Bonzini wrote:
> > 
> > What if vx contained a struct vcpu_runstate_info instead of having
> > separate fields? That might combine the best of the on-stack version
> > and the best of the special-memcpy version...
> 
> Well... the state field is in a different place for compat vs. 64-bit
> which makes that fun, and the state_entry_time field need to have the
> XEN_RUNSTATE_UPDATE bit masked into it which means we don't really want
> to just memcpy *that* part... and we already *are* just doing a memcpy
> for the rest, which is the time[] array.
> 
> But we *can* recognise that the fast and slow paths aren't all that
> different really, and combine them. How's this?

That version applies immediately after the existing 'Compatibility
fixes for shared runstate area' commit, in case you wanted to squash
them.

Alternatively if you prefer it that way, I made a few trivial cosmetic
cleanups and rebased it on top of the existing kvm/queue. I also
rebased the remaining GPC cleanups on top of that, in my tree at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-11-24 11:31 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-19  9:46 [PATCH 1/4] MAINTAINERS: Add KVM x86/xen maintainer list David Woodhouse
2022-11-19  9:46 ` [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area David Woodhouse
2022-11-19 11:38   ` Durrant, Paul
2022-11-19 11:46     ` David Woodhouse
2022-11-22 18:39   ` Sean Christopherson
2022-11-22 23:04     ` David Woodhouse
2022-11-23 17:17       ` Sean Christopherson
2022-11-23 17:58         ` David Woodhouse
2022-11-23 18:16           ` Sean Christopherson
2022-11-23 18:48             ` David Woodhouse
2022-11-22 23:46   ` Michal Luczaj
2022-11-23  0:03     ` David Woodhouse
2022-11-23 18:21   ` Sean Christopherson
2022-11-23 19:22     ` David Woodhouse
2022-11-23 19:32       ` Sean Christopherson
2022-11-23 20:07         ` David Woodhouse
2022-11-23 20:26           ` Sean Christopherson
2022-11-19  9:46 ` [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page David Woodhouse
2022-11-19 11:40   ` Durrant, Paul
2022-11-22 16:23     ` Sean Christopherson
2022-11-22 16:49       ` Paul Durrant
2022-11-22 17:02         ` David Woodhouse
2022-11-22 19:09           ` Sean Christopherson
2022-11-22 23:13             ` David Woodhouse
2022-11-24  0:31           ` Paolo Bonzini
2022-11-24  0:45             ` David Woodhouse
2022-11-24  0:54             ` David Woodhouse
2022-11-24  1:11               ` Paolo Bonzini
2022-11-24 10:41                 ` David Woodhouse
2022-11-24 11:30                   ` David Woodhouse
2022-11-19  9:46 ` [PATCH 4/4] KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary David Woodhouse
2022-11-19 11:42   ` Durrant, Paul

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.