All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
	rkrcmar@redhat.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check
Date: Mon, 19 Dec 2016 12:42:41 +0100	[thread overview]
Message-ID: <20161219114241.GD4927@redhat.com> (raw)
In-Reply-To: <1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com>

Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
> 
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
>  arch/x86/kvm/x86.c                   | 16 ++++++++++++++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> +	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> +		return;
> +
> +	vcpu->arch.st.steal.preempted = 1;
> +
> +	kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +			&vcpu->arch.st.steal.preempted,
> +			offsetof(struct kvm_steal_time, preempted),
> +			sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> +	kvm_steal_time_set_preempted(vcpu);
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
>  	vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.

>From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 18:43:52 +0100
Subject: [PATCH 1/2] kvm: fix schedule in atomic in
 kvm_steal_time_set_preempted()

kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f0d238..2dabaeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * Disable page faults because we're in atomic context here.
+	 * kvm_write_guest_offset_cached() would call might_fault()
+	 * that relies on pagefault_disable() to tell if there's a
+	 * bug. NOTE: the write to guest memory may not go through if
+	 * during postcopy live migration or if there's heavy guest
+	 * paging.
+	 */
+	pagefault_disable();
 	kvm_steal_time_set_preempted(vcpu);
+	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();


>From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 19:13:32 +0100
Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()

kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2dabaeb..02e6ab4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	int idx;
 	/*
 	 * Disable page faults because we're in atomic context here.
 	 * kvm_write_guest_offset_cached() would call might_fault()
@@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	 * paging.
 	 */
 	pagefault_disable();
+	/*
+	 * kvm_memslots() will be called by
+	 * kvm_write_guest_offset_cached() so take the srcu lock.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
 	kvm_steal_time_set_preempted(vcpu);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: kvm@vger.kernel.org, rkrcmar@redhat.com, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	virtualization@lists.linux-foundation.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check
Date: Mon, 19 Dec 2016 12:42:41 +0100	[thread overview]
Message-ID: <20161219114241.GD4927@redhat.com> (raw)
In-Reply-To: <1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com>

Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
> 
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
>  arch/x86/kvm/x86.c                   | 16 ++++++++++++++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> +	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> +		return;
> +
> +	vcpu->arch.st.steal.preempted = 1;
> +
> +	kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +			&vcpu->arch.st.steal.preempted,
> +			offsetof(struct kvm_steal_time, preempted),
> +			sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> +	kvm_steal_time_set_preempted(vcpu);
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
>  	vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.

>From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 18:43:52 +0100
Subject: [PATCH 1/2] kvm: fix schedule in atomic in
 kvm_steal_time_set_preempted()

kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f0d238..2dabaeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * Disable page faults because we're in atomic context here.
+	 * kvm_write_guest_offset_cached() would call might_fault()
+	 * that relies on pagefault_disable() to tell if there's a
+	 * bug. NOTE: the write to guest memory may not go through if
+	 * during postcopy live migration or if there's heavy guest
+	 * paging.
+	 */
+	pagefault_disable();
 	kvm_steal_time_set_preempted(vcpu);
+	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();


>From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 19:13:32 +0100
Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()

kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2dabaeb..02e6ab4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	int idx;
 	/*
 	 * Disable page faults because we're in atomic context here.
 	 * kvm_write_guest_offset_cached() would call might_fault()
@@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	 * paging.
 	 */
 	pagefault_disable();
+	/*
+	 * kvm_memslots() will be called by
+	 * kvm_write_guest_offset_cached() so take the srcu lock.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
 	kvm_steal_time_set_preempted(vcpu);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
	qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
	rkrcmar@redhat.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check
Date: Mon, 19 Dec 2016 12:42:41 +0100	[thread overview]
Message-ID: <20161219114241.GD4927@redhat.com> (raw)
In-Reply-To: <1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com>

Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
> 
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
>  arch/x86/kvm/x86.c                   | 16 ++++++++++++++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> +	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> +		return;
> +
> +	vcpu->arch.st.steal.preempted = 1;
> +
> +	kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +			&vcpu->arch.st.steal.preempted,
> +			offsetof(struct kvm_steal_time, preempted),
> +			sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> +	kvm_steal_time_set_preempted(vcpu);
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
>  	vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.

>From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 18:43:52 +0100
Subject: [PATCH 1/2] kvm: fix schedule in atomic in
 kvm_steal_time_set_preempted()

kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f0d238..2dabaeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * Disable page faults because we're in atomic context here.
+	 * kvm_write_guest_offset_cached() would call might_fault()
+	 * that relies on pagefault_disable() to tell if there's a
+	 * bug. NOTE: the write to guest memory may not go through if
+	 * during postcopy live migration or if there's heavy guest
+	 * paging.
+	 */
+	pagefault_disable();
 	kvm_steal_time_set_preempted(vcpu);
+	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();


>From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 19:13:32 +0100
Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()

kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2dabaeb..02e6ab4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	int idx;
 	/*
 	 * Disable page faults because we're in atomic context here.
 	 * kvm_write_guest_offset_cached() would call might_fault()
@@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	 * paging.
 	 */
 	pagefault_disable();
+	/*
+	 * kvm_memslots() will be called by
+	 * kvm_write_guest_offset_cached() so take the srcu lock.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
 	kvm_steal_time_set_preempted(vcpu);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: kvm@vger.kernel.org, rkrcmar@redhat.com, qemu-devel@nongnu.org,
	linux-kernel@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	virtualization@lists.linux-foundation.org,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check
Date: Mon, 19 Dec 2016 12:42:41 +0100	[thread overview]
Message-ID: <20161219114241.GD4927@redhat.com> (raw)
In-Reply-To: <1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com>

Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
> 
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
> 
> Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
>  arch/x86/kvm/x86.c                   | 16 ++++++++++++++++
>  2 files changed, 19 insertions(+), 1 deletion(-)
> 
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> +	if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> +		return;
> +
> +	vcpu->arch.st.steal.preempted = 1;
> +
> +	kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> +			&vcpu->arch.st.steal.preempted,
> +			offsetof(struct kvm_steal_time, preempted),
> +			sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> +	kvm_steal_time_set_preempted(vcpu);
>  	kvm_x86_ops->vcpu_put(vcpu);
>  	kvm_put_guest_fpu(vcpu);
>  	vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.

From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 18:43:52 +0100
Subject: [PATCH 1/2] kvm: fix schedule in atomic in
 kvm_steal_time_set_preempted()

kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f0d238..2dabaeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * Disable page faults because we're in atomic context here.
+	 * kvm_write_guest_offset_cached() would call might_fault()
+	 * that relies on pagefault_disable() to tell if there's a
+	 * bug. NOTE: the write to guest memory may not go through if
+	 * during postcopy live migration or if there's heavy guest
+	 * paging.
+	 */
+	pagefault_disable();
 	kvm_steal_time_set_preempted(vcpu);
+	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();


From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <aarcange@redhat.com>
Date: Sat, 17 Dec 2016 19:13:32 +0100
Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()

kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/x86.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2dabaeb..02e6ab4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+	int idx;
 	/*
 	 * Disable page faults because we're in atomic context here.
 	 * kvm_write_guest_offset_cached() would call might_fault()
@@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 	 * paging.
 	 */
 	pagefault_disable();
+	/*
+	 * kvm_memslots() will be called by
+	 * kvm_write_guest_offset_cached() so take the srcu lock.
+	 */
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
 	kvm_steal_time_set_preempted(vcpu);
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	pagefault_enable();
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);

  parent reply	other threads:[~2016-12-19 11:42 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-02  9:08 [PATCH v7 00/11] implement vcpu preempted check Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 01/11] kernel/sched: introduce vcpu preempted check interface Pan Xinhui
2016-11-02  9:08   ` Pan Xinhui
2016-11-22 12:31   ` [tip:locking/core] sched/core: Introduce the vcpu_is_preempted(cpu) interface tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 01/11] kernel/sched: introduce vcpu preempted check interface Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 02/11] locking/osq: Drop the overload of osq_lock() Pan Xinhui
2016-11-02  9:08   ` Pan Xinhui
2016-11-22 12:36   ` [tip:locking/core] locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock() tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 02/11] locking/osq: Drop the overload of osq_lock() Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner Pan Xinhui
2016-11-02  9:08   ` [PATCH v7 03/11] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner Pan Xinhui
2016-11-02  9:08   ` Pan Xinhui
2016-11-22 12:36   ` [tip:locking/core] locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 03/11] kernel/locking: Drop the overload of {mutex, rwsem}_spin_on_owner Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 04/11] powerpc/spinlock: support vcpu preempted check Pan Xinhui
2016-11-02  9:08   ` Pan Xinhui
2016-11-22 12:32   ` [tip:locking/core] locking/core, powerpc: Implement vcpu_is_preempted(cpu) tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 04/11] powerpc/spinlock: support vcpu preempted check Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 05/11] s390/spinlock: Provide vcpu_is_preempted Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-22 12:32   ` [tip:locking/core] locking/spinlocks, s390: Implement vcpu_is_preempted(cpu) tip-bot for Christian Borntraeger
2016-11-02  9:08 ` [PATCH v7 05/11] s390/spinlock: Provide vcpu_is_preempted Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check Pan Xinhui
2016-11-02  9:08   ` Pan Xinhui
2016-11-15 15:47   ` Peter Zijlstra
2016-11-15 15:47     ` Peter Zijlstra
2016-11-16  4:19     ` Pan Xinhui
2016-11-16  4:19     ` Pan Xinhui
2016-11-16 10:23       ` Peter Zijlstra
2016-11-16 10:23         ` Peter Zijlstra
2016-11-16 11:29         ` Christian Borntraeger
2016-11-16 11:29         ` Christian Borntraeger
2016-11-16 11:29         ` Christian Borntraeger
2016-11-16 11:43           ` Peter Zijlstra
2016-11-16 11:43             ` Peter Zijlstra
2016-11-16 11:43           ` Peter Zijlstra
2016-11-17  5:16         ` Pan Xinhui
2016-11-17  5:16           ` Pan Xinhui
2016-11-17  5:16         ` Pan Xinhui
2016-11-16 10:23       ` Peter Zijlstra
2016-11-16  4:19     ` Pan Xinhui
2016-11-15 15:47   ` Peter Zijlstra
2016-11-22 12:33   ` [tip:locking/core] locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 07/11] KVM: Introduce kvm_write_guest_offset_cached Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-22 12:33   ` [tip:locking/core] kvm: Introduce kvm_write_guest_offset_cached() tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-22 12:34   ` [tip:locking/core] x86/kvm: Support the vCPU preemption check tip-bot for Pan Xinhui
2016-12-19 11:42   ` Andrea Arcangeli [this message]
2016-12-19 11:42     ` [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check Andrea Arcangeli
2016-12-19 11:42     ` [Qemu-devel] " Andrea Arcangeli
2016-12-19 11:42     ` Andrea Arcangeli
2016-12-19 13:56     ` Pan Xinhui
2016-12-19 13:56       ` [Qemu-devel] " Pan Xinhui
2016-12-19 13:56       ` Pan Xinhui
2016-12-19 14:39       ` Paolo Bonzini
2016-12-19 14:39         ` [Qemu-devel] " Paolo Bonzini
2016-12-19 14:39         ` Paolo Bonzini
2016-11-02  9:08 ` Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 09/11] x86, kernel/kvm.c: " Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-22 12:34   ` [tip:locking/core] x86/kvm: Support the vCPU preemption check tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 10/11] x86, xen: support vcpu preempted check Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui
2016-11-02  9:08   ` Pan Xinhui
2016-11-22 12:35   ` [tip:locking/core] x86/xen: Support the vCPU preemption check tip-bot for Juergen Gross
2016-11-02  9:08 ` [PATCH v7 11/11] Documentation: virtual: kvm: Support vcpu preempted check Pan Xinhui
2016-11-22 12:35   ` [tip:locking/core] Documentation/virtual/kvm: Support the vCPU preemption check tip-bot for Pan Xinhui
2016-11-02  9:08 ` [PATCH v7 11/11] Documentation: virtual: kvm: Support vcpu preempted check Pan Xinhui
2016-11-02  9:08 ` Pan Xinhui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161219114241.GD4927@redhat.com \
    --to=aarcange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rkrcmar@redhat.com \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=xinhui.pan@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.