All of lore.kernel.org
 help / color / mirror / Atom feed
* Q. about KVM and CPU hotplug
@ 2021-11-30  8:27 Tian, Kevin
  2021-11-30  9:28 ` Paolo Bonzini
  0 siblings, 1 reply; 9+ messages in thread
From: Tian, Kevin @ 2021-11-30  8:27 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao

Hi, Paolo/Thomas,

I'm curious about the consequence if KVM fails to initialize a 
hotplugged CPU.

Looking at the code KVM has been added to the CPU hotplug state 
machine:

	r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
		kvm_starting_cpu, kvm_dying_cpu);

	static int kvm_starting_cpu(unsigned int cpu)
	{
		raw_spin_lock(&kvm_count_lock);
		if (kvm_usage_count)
			hardware_enable_nolock(NULL);
		raw_spin_unlock(&kvm_count_lock);
		return 0;
	}

kvm_starting_cpu() always return success as the callbacks in the 
STARTING section are not allowed to fail.

However hardware_enable_nolock() may fail for various reasons:

	static void hardware_enable_nolock(void *junk)
	{
		int cpu = raw_smp_processor_id();
		int r;

		if (cpumask_test_cpu(cpu, cpus_hardware_enabled))
			return;

		cpumask_set_cpu(cpu, cpus_hardware_enabled);

		r = kvm_arch_hardware_enable();

		if (r) {
			cpumask_clear_cpu(cpu, cpus_hardware_enabled);
			atomic_inc(&hardware_enable_failed);
			pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
		}
	}

Upon error hardware_enable_failed is incremented. However this variable
is checked only in hardware_enable_all() called when the 1st VM is called.

This implies that KVM may be left in a state where it doesn't know a CPU 
not ready to host VMX operations.

Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error 
to Qemu at some point or may it lead to undefined behavior? And is there
any method to prevent vCPU thread from being scheduled to the CPU?

We found this open when considering TDX and CPU hotplug. 

By design the current generation of TDX doesn't support CPU hotplug. 
Only boot-time CPUs can be initialized for TDX (and must be done en 
masse in one breath). Attempting to do seamcalls on a hotplugged CPU
simply fails, thus it potentially affects any trusted domain in case its
vCPUs are scheduled to the plugged CPU. 

There is a puzzle whether we should just document such restriction or 
need more proactive measure (e.g. to prevent such case happen). Since 
it's similar to above situation where KVM fails to init on a hotplugged 
CPU, we'd like to seek your suggestion first.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Q. about KVM and CPU hotplug
  2021-11-30  8:27 Q. about KVM and CPU hotplug Tian, Kevin
@ 2021-11-30  9:28 ` Paolo Bonzini
  2021-11-30 14:05   ` Thomas Gleixner
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Paolo Bonzini @ 2021-11-30  9:28 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao

On 11/30/21 09:27, Tian, Kevin wrote:
> 		r = kvm_arch_hardware_enable();
> 
> 		if (r) {
> 			cpumask_clear_cpu(cpu, cpus_hardware_enabled);
> 			atomic_inc(&hardware_enable_failed);
> 			pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
> 		}
> 	}
> 
> Upon error hardware_enable_failed is incremented. However this variable
> is checked only in hardware_enable_all() called when the 1st VM is called.
> 
> This implies that KVM may be left in a state where it doesn't know a CPU
> not ready to host VMX operations.
> 
> Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
> KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
> to Qemu at some point or may it lead to undefined behavior? And is there
> any method to prevent vCPU thread from being scheduled to the CPU?

It should fail the first vmptrld instruction.  It will result in a few 
WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX this 
should be a pretty bad firmware bug, and it has never been reported. 
KVM did find some undocumented errata but not this one!

I don't think there's any fix other than pinning userspace.  The WARNs 
can be eliminated by calling KVM_BUG_ON in the sched_in notifier, plus 
checking if the VM is bugged before entering the guest or doing a 
VMREAD/VMWRITE (usually the check is done only in a ioctl).  But some 
refactoring is probably needed to make the code more robust in general.

Paolo

> By design the current generation of TDX doesn't support CPU hotplug. 
> Only boot-time CPUs can be initialized for TDX (and must be done en 
> masse in one breath). Attempting to do seamcalls on a hotplugged CPU
> simply fails, thus it potentially affects any trusted domain in case its
> vCPUs are scheduled to the plugged CPU. 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Q. about KVM and CPU hotplug
  2021-11-30  9:28 ` Paolo Bonzini
@ 2021-11-30 14:05   ` Thomas Gleixner
  2021-11-30 16:27     ` Paolo Bonzini
  2021-11-30 20:02   ` Sean Christopherson
  2021-12-01  6:59   ` Tian, Kevin
  2 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2021-11-30 14:05 UTC (permalink / raw)
  To: Paolo Bonzini, Tian, Kevin
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao, Peter Zijlstra

On Tue, Nov 30 2021 at 10:28, Paolo Bonzini wrote:

> On 11/30/21 09:27, Tian, Kevin wrote:
>> 		r = kvm_arch_hardware_enable();
>> 
>> 		if (r) {
>> 			cpumask_clear_cpu(cpu, cpus_hardware_enabled);
>> 			atomic_inc(&hardware_enable_failed);
>> 			pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
>> 		}
>> 	}
>> 
>> Upon error hardware_enable_failed is incremented. However this variable
>> is checked only in hardware_enable_all() called when the 1st VM is called.
>> 
>> This implies that KVM may be left in a state where it doesn't know a CPU
>> not ready to host VMX operations.
>> 
>> Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
>> KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
>> to Qemu at some point or may it lead to undefined behavior? And is there
>> any method to prevent vCPU thread from being scheduled to the CPU?
>
> It should fail the first vmptrld instruction.  It will result in a few 
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX this 
> should be a pretty bad firmware bug, and it has never been reported. 
> KVM did find some undocumented errata but not this one!
>
> I don't think there's any fix other than pinning userspace.  The WARNs 
> can be eliminated by calling KVM_BUG_ON in the sched_in notifier, plus 
> checking if the VM is bugged before entering the guest or doing a 
> VMREAD/VMWRITE (usually the check is done only in a ioctl).  But some 
> refactoring is probably needed to make the code more robust in general.

Why is this hotplug callback in the CPU starting section to begin with?

If you stick it into the online section which runs on the hotplugged CPU
in thread context:

	CPUHP_AP_ONLINE_IDLE,

-->   	CPUHP_AP_KVM_STARTING,

	CPUHP_AP_SCHED_WAIT_EMPTY,

then it is allowed to fail and it still works in the right way.

When onlining a CPU then there cannot be any vCPU task run on the
CPU at that point.

When offlining a CPU then it's guaranteed that all user tasks and
non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU
task around either.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Q. about KVM and CPU hotplug
  2021-11-30 14:05   ` Thomas Gleixner
@ 2021-11-30 16:27     ` Paolo Bonzini
  2021-12-01  7:18       ` Tian, Kevin
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2021-11-30 16:27 UTC (permalink / raw)
  To: Thomas Gleixner, Tian, Kevin
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao, Peter Zijlstra

On 11/30/21 15:05, Thomas Gleixner wrote:
> Why is this hotplug callback in the CPU starting section to begin with?

Just because the old notifier implementation used CPU_STARTING - in fact 
the commit messages say that CPU_STARTING was added partly *for* KVM 
(commit e545a6140b69, "kernel/cpu.c: create a CPU_STARTING cpu_chain 
notifier", 2008-09-08).

> If you stick it into the online section which runs on the hotplugged CPU
> in thread context:
> 
> 	CPUHP_AP_ONLINE_IDLE,
> 
> -->   	CPUHP_AP_KVM_STARTING,
> 
> 	CPUHP_AP_SCHED_WAIT_EMPTY,
> 
> then it is allowed to fail and it still works in the right way.

Yes, moving it to the online section should be fine; it wouldn't solve 
the TDX problem however.  Failure would rollback the hotplug and forbid 
hotplug altogether when TDX is loaded, which is not acceptable.

Paolo

> When onlining a CPU then there cannot be any vCPU task run on the
> CPU at that point.
> 
> When offlining a CPU then it's guaranteed that all user tasks and
> non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU
> task around either.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Q. about KVM and CPU hotplug
  2021-11-30  9:28 ` Paolo Bonzini
  2021-11-30 14:05   ` Thomas Gleixner
@ 2021-11-30 20:02   ` Sean Christopherson
  2021-12-01  6:59   ` Tian, Kevin
  2 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2021-11-30 20:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Tian, Kevin, Thomas Gleixner, linux-kernel, kvm, Yamahata, Isaku,
	Huang, Kai, Nakajima, Jun, Hansen, Dave, Gao, Chao

On Tue, Nov 30, 2021, Paolo Bonzini wrote:
> On 11/30/21 09:27, Tian, Kevin wrote:
> > 		r = kvm_arch_hardware_enable();
> > 
> > 		if (r) {
> > 			cpumask_clear_cpu(cpu, cpus_hardware_enabled);
> > 			atomic_inc(&hardware_enable_failed);
> > 			pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
> > 		}
> > 	}
> > 
> > Upon error hardware_enable_failed is incremented. However this variable
> > is checked only in hardware_enable_all() called when the 1st VM is called.
> > 
> > This implies that KVM may be left in a state where it doesn't know a CPU
> > not ready to host VMX operations.
> > 
> > Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
> > KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
> > to Qemu at some point or may it lead to undefined behavior? And is there
> > any method to prevent vCPU thread from being scheduled to the CPU?
> 
> It should fail the first vmptrld instruction.  It will result in a few
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX this
> should be a pretty bad firmware bug, and it has never been reported. KVM did
> find some undocumented errata but not this one!

Heh, writing MSR_TEST_CTRL on some CPUs, e.g. Haswell, magically disables VMX.
Not exactly CPU hotplug, but we got close :-)  But yeah, if enabling VMX fails,
something in the CPU is badly mangled.

009bce1df0bb ("x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted")

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Q. about KVM and CPU hotplug
  2021-11-30  9:28 ` Paolo Bonzini
  2021-11-30 14:05   ` Thomas Gleixner
  2021-11-30 20:02   ` Sean Christopherson
@ 2021-12-01  6:59   ` Tian, Kevin
  2021-12-01 10:30     ` Thomas Gleixner
  2 siblings, 1 reply; 9+ messages in thread
From: Tian, Kevin @ 2021-12-01  6:59 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao

> From: Paolo Bonzini <paolo.bonzini@gmail.com>
> Sent: Tuesday, November 30, 2021 5:29 PM
> 
> On 11/30/21 09:27, Tian, Kevin wrote:
> > 		r = kvm_arch_hardware_enable();
> >
> > 		if (r) {
> > 			cpumask_clear_cpu(cpu, cpus_hardware_enabled);
> > 			atomic_inc(&hardware_enable_failed);
> > 			pr_info("kvm: enabling virtualization on CPU%d
> failed\n", cpu);
> > 		}
> > 	}
> >
> > Upon error hardware_enable_failed is incremented. However this variable
> > is checked only in hardware_enable_all() called when the 1st VM is called.
> >
> > This implies that KVM may be left in a state where it doesn't know a CPU
> > not ready to host VMX operations.
> >
> > Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
> > KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
> > to Qemu at some point or may it lead to undefined behavior? And is there
> > any method to prevent vCPU thread from being scheduled to the CPU?
> 
> It should fail the first vmptrld instruction.  It will result in a few
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX this
> should be a pretty bad firmware bug, and it has never been reported.
> KVM did find some undocumented errata but not this one!
> 

or it may be caused by incompatible CPU capabilities, which is currently
missing a check in kvm_starting_cpu(). So far the compatibility check is
done only once before registering cpu hotplug state machine:

        for_each_online_cpu(cpu) {
                smp_call_function_single(cpu, check_processor_compat, &c, 1);
                if (r < 0)
                        goto out_free_2;
        }

        r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
                                      kvm_starting_cpu, kvm_dying_cpu);

Thanks
Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Q. about KVM and CPU hotplug
  2021-11-30 16:27     ` Paolo Bonzini
@ 2021-12-01  7:18       ` Tian, Kevin
  0 siblings, 0 replies; 9+ messages in thread
From: Tian, Kevin @ 2021-12-01  7:18 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao, Peter Zijlstra

> From: Paolo Bonzini <paolo.bonzini@gmail.com>
> Sent: Wednesday, December 1, 2021 12:27 AM
> 
> On 11/30/21 15:05, Thomas Gleixner wrote:
> > Why is this hotplug callback in the CPU starting section to begin with?
> 
> Just because the old notifier implementation used CPU_STARTING - in fact
> the commit messages say that CPU_STARTING was added partly *for* KVM
> (commit e545a6140b69, "kernel/cpu.c: create a CPU_STARTING cpu_chain
> notifier", 2008-09-08).
> 
> > If you stick it into the online section which runs on the hotplugged CPU
> > in thread context:
> >
> > 	CPUHP_AP_ONLINE_IDLE,
> >
> > -->   	CPUHP_AP_KVM_STARTING,
> >
> > 	CPUHP_AP_SCHED_WAIT_EMPTY,
> >
> > then it is allowed to fail and it still works in the right way.
> 
> Yes, moving it to the online section should be fine; it wouldn't solve
> the TDX problem however.  Failure would rollback the hotplug and forbid
> hotplug altogether when TDX is loaded, which is not acceptable.
> 

Fail hotplug just because TDX is loaded is not acceptable.

But fail hotplug when a trusted domain using TDX is active imo makes 
sense. It's similar philosophy to VMX which, with above change, will 
fail hotplug when kvm_usage_count is non-zero (implying a VM is 
active) but VMX initialization fails on this CPU. We can add similar
tdx_usage_count to mark active TDX users and forbid hotplug
when this variable is non-zero.

In general I think it's an acceptable policy to fail an operation if it 
breaks active existing usages... 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Q. about KVM and CPU hotplug
  2021-12-01  6:59   ` Tian, Kevin
@ 2021-12-01 10:30     ` Thomas Gleixner
  2021-12-04  3:57       ` Tian, Kevin
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2021-12-01 10:30 UTC (permalink / raw)
  To: Tian, Kevin, Paolo Bonzini
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao

On Wed, Dec 01 2021 at 06:59, Kevin Tian wrote:
>> From: Paolo Bonzini <paolo.bonzini@gmail.com>
>> It should fail the first vmptrld instruction.  It will result in a few
>> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX this
>> should be a pretty bad firmware bug, and it has never been reported.
>> KVM did find some undocumented errata but not this one!
>> 
>
> or it may be caused by incompatible CPU capabilities, which is currently
> missing a check in kvm_starting_cpu(). So far the compatibility check is
> done only once before registering cpu hotplug state machine:
>
>         for_each_online_cpu(cpu) {
>                 smp_call_function_single(cpu, check_processor_compat, &c, 1);
>                 if (r < 0)
>                         goto out_free_2;
>         }
>
>         r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
>                                       kvm_starting_cpu, kvm_dying_cpu);

Duh. This is silly _and_ broken.

Using for_each_inline_cpu() without holding cpus_read_lock() is racy
against concurrent hotplug. But even if the locking is added then
nothing prevents a CPU from being plugged _after_ the lock is dropped.

The right solution is to move the hotplug state into the threaded
section as I pointed out and do:

         r = cpuhp_setup_state(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
                               kvm_starting_cpu, kvm_dying_cpu);

which will do the right thing automatically. Checking for compatibility
would just be part of the kvm_starting_cpu() callback.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Q. about KVM and CPU hotplug
  2021-12-01 10:30     ` Thomas Gleixner
@ 2021-12-04  3:57       ` Tian, Kevin
  0 siblings, 0 replies; 9+ messages in thread
From: Tian, Kevin @ 2021-12-04  3:57 UTC (permalink / raw)
  To: Thomas Gleixner, Paolo Bonzini
  Cc: linux-kernel, kvm, Yamahata, Isaku, Huang, Kai, Nakajima, Jun,
	Hansen, Dave, Gao, Chao

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, December 1, 2021 6:31 PM
> 
> On Wed, Dec 01 2021 at 06:59, Kevin Tian wrote:
> >> From: Paolo Bonzini <paolo.bonzini@gmail.com>
> >> It should fail the first vmptrld instruction.  It will result in a few
> >> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed).  For VMX
> this
> >> should be a pretty bad firmware bug, and it has never been reported.
> >> KVM did find some undocumented errata but not this one!
> >>
> >
> > or it may be caused by incompatible CPU capabilities, which is currently
> > missing a check in kvm_starting_cpu(). So far the compatibility check is
> > done only once before registering cpu hotplug state machine:
> >
> >         for_each_online_cpu(cpu) {
> >                 smp_call_function_single(cpu, check_processor_compat, &c, 1);
> >                 if (r < 0)
> >                         goto out_free_2;
> >         }
> >
> >         r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING,
> "kvm/cpu:starting",
> >                                       kvm_starting_cpu, kvm_dying_cpu);
> 
> Duh. This is silly _and_ broken.
> 
> Using for_each_inline_cpu() without holding cpus_read_lock() is racy
> against concurrent hotplug. But even if the locking is added then
> nothing prevents a CPU from being plugged _after_ the lock is dropped.
> 
> The right solution is to move the hotplug state into the threaded
> section as I pointed out and do:
> 
>          r = cpuhp_setup_state(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
>                                kvm_starting_cpu, kvm_dying_cpu);
> 
> which will do the right thing automatically. Checking for compatibility
> would just be part of the kvm_starting_cpu() callback.
> 

Yes, this sounds the right thing to do. We'll work on a fix.

And as said in another reply to Paolo, future TDX compatibility check
will also be added to this callback.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-12-04  3:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-30  8:27 Q. about KVM and CPU hotplug Tian, Kevin
2021-11-30  9:28 ` Paolo Bonzini
2021-11-30 14:05   ` Thomas Gleixner
2021-11-30 16:27     ` Paolo Bonzini
2021-12-01  7:18       ` Tian, Kevin
2021-11-30 20:02   ` Sean Christopherson
2021-12-01  6:59   ` Tian, Kevin
2021-12-01 10:30     ` Thomas Gleixner
2021-12-04  3:57       ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.