* [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
@ 2021-10-04 22:26 Nitesh Narayan Lal
2021-10-05 9:38 ` Paolo Bonzini
0 siblings, 1 reply; 7+ messages in thread
From: Nitesh Narayan Lal @ 2021-10-04 22:26 UTC (permalink / raw)
To: linux-kernel, kvm, pbonzini, seanjc, vkuznets, mtosatti, tglx,
frederic, mingo, nilal
From: Marcelo Tosatti <mtosatti@redhat.com>
kvm_vm_worker_thread() creates a kthread VM worker and migrates it
to the parent cgroup using cgroup_attach_task_all() based on its
effective cpumask.
In an environment that is booted with the nohz_full kernel option, cgroup's
effective cpumask can also include CPUs running in nohz_full mode. These
CPUs often run SCHED_FIFO tasks which may result in the starvation of the
VM worker if it has been migrated to one of these CPUs.
Since unbounded kernel threads allowed CPU mask already respects nohz_full
CPUs at the time of their setup (because of 9cc5b8656892: "isolcpus: Affine
unbound kernel threads to housekeeping cpus"), retain the initial CPU mask
for the kthread by stopping its migration to the parent cgroup's effective
CPUs.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
virt/kvm/kvm_main.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7851f3a1b5f7..87bc193fd020 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -56,6 +56,7 @@
#include <asm/processor.h>
#include <asm/ioctl.h>
#include <linux/uaccess.h>
+#include <linux/sched/isolation.h>
#include "coalesced_mmio.h"
#include "async_pf.h"
@@ -5634,11 +5635,20 @@ static int kvm_vm_worker_thread(void *context)
if (err)
goto init_complete;
- err = cgroup_attach_task_all(init_context->parent, current);
- if (err) {
- kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
- __func__, err);
- goto init_complete;
+ /*
+ * For nohz_full enabled environments, don't migrate the worker thread
+ * to parent cgroup as its effective mask may have a CPU running in
+ * nohz_full mode. nohz_full CPUs often run SCHED_FIFO task which could
+ * result in starvation of the worker thread if it is pinned on the same
+ * CPU.
+ */
+ if (!housekeeping_enabled(HK_FLAG_KTHREAD)) {
+ err = cgroup_attach_task_all(init_context->parent, current);
+ if (err) {
+ kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
+ __func__, err);
+ goto init_complete;
+ }
}
set_user_nice(current, task_nice(init_context->parent));
--
2.27.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
2021-10-04 22:26 [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker Nitesh Narayan Lal
@ 2021-10-05 9:38 ` Paolo Bonzini
2021-10-05 10:58 ` Marcelo Tosatti
0 siblings, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2021-10-05 9:38 UTC (permalink / raw)
To: Nitesh Narayan Lal, linux-kernel, kvm, seanjc, vkuznets,
mtosatti, tglx, frederic, mingo, nilal, Wanpeng Li
[+Wanpeng]
On 05/10/21 00:26, Nitesh Narayan Lal wrote:
> From: Marcelo Tosatti <mtosatti@redhat.com>
>
> kvm_vm_worker_thread() creates a kthread VM worker and migrates it
> to the parent cgroup using cgroup_attach_task_all() based on its
> effective cpumask.
>
> In an environment that is booted with the nohz_full kernel option, cgroup's
> effective cpumask can also include CPUs running in nohz_full mode. These
> CPUs often run SCHED_FIFO tasks which may result in the starvation of the
> VM worker if it has been migrated to one of these CPUs.
There are other effects of cgroups (e.g. memory accounting) than just
the cpumask; for v1 you could just skip the cpuset, but if
cgroup_attach_task_all is ever ported to v2's cgroup_attach_task, we
will not be able to separate the cpuset cgroup from the others.
Why doesn't the scheduler move the task to a CPU that is not being
hogged by vCPU SCHED_FIFO tasks? The parent cgroup should always have
one for userspace's own housekeeping.
As an aside, if we decide that KVM's worker threads count as
housekeeping, you'd still want to bind the kthread to the housekeeping
CPUs(*).
Paolo
(*) switching from kthread_run to kthread_create+kthread_bind_mask
> Since unbounded kernel threads allowed CPU mask already respects nohz_full
> CPUs at the time of their setup (because of 9cc5b8656892: "isolcpus: Affine
> unbound kernel threads to housekeeping cpus"), retain the initial CPU mask
> for the kthread by stopping its migration to the parent cgroup's effective
> CPUs.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
> virt/kvm/kvm_main.c | 20 +++++++++++++++-----
> 1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7851f3a1b5f7..87bc193fd020 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -56,6 +56,7 @@
> #include <asm/processor.h>
> #include <asm/ioctl.h>
> #include <linux/uaccess.h>
> +#include <linux/sched/isolation.h>
>
> #include "coalesced_mmio.h"
> #include "async_pf.h"
> @@ -5634,11 +5635,20 @@ static int kvm_vm_worker_thread(void *context)
> if (err)
> goto init_complete;
>
> - err = cgroup_attach_task_all(init_context->parent, current);
> - if (err) {
> - kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
> - __func__, err);
> - goto init_complete;
> + /*
> + * For nohz_full enabled environments, don't migrate the worker thread
> + * to parent cgroup as its effective mask may have a CPU running in
> + * nohz_full mode. nohz_full CPUs often run SCHED_FIFO task which could
> + * result in starvation of the worker thread if it is pinned on the same
> + * CPU.
> + */
> + if (!housekeeping_enabled(HK_FLAG_KTHREAD)) {
> + err = cgroup_attach_task_all(init_context->parent, current);
> + if (err) {
> + kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
> + __func__, err);
> + goto init_complete;
> + }
> }
>
> set_user_nice(current, task_nice(init_context->parent));
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
2021-10-05 9:38 ` Paolo Bonzini
@ 2021-10-05 10:58 ` Marcelo Tosatti
[not found] ` <96f38a69-2ff8-a78c-a417-d32f1eb742be@redhat.com>
0 siblings, 1 reply; 7+ messages in thread
From: Marcelo Tosatti @ 2021-10-05 10:58 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Nitesh Narayan Lal, linux-kernel, kvm, seanjc, vkuznets, tglx,
frederic, mingo, nilal, Wanpeng Li
Hi Paolo,
On Tue, Oct 05, 2021 at 11:38:29AM +0200, Paolo Bonzini wrote:
> [+Wanpeng]
>
> On 05/10/21 00:26, Nitesh Narayan Lal wrote:
> > From: Marcelo Tosatti <mtosatti@redhat.com>
> >
> > kvm_vm_worker_thread() creates a kthread VM worker and migrates it
> > to the parent cgroup using cgroup_attach_task_all() based on its
> > effective cpumask.
> >
> > In an environment that is booted with the nohz_full kernel option, cgroup's
> > effective cpumask can also include CPUs running in nohz_full mode. These
> > CPUs often run SCHED_FIFO tasks which may result in the starvation of the
> > VM worker if it has been migrated to one of these CPUs.
>
> There are other effects of cgroups (e.g. memory accounting) than just the
> cpumask; for v1 you could just skip the cpuset, but if
> cgroup_attach_task_all is ever ported to v2's cgroup_attach_task, we will
> not be able to separate the cpuset cgroup from the others.
cgroup_attach_task_all does use cgroup_attach_task on linux-2.6.git...
It would be good to have this working on both cgroup-v1 and cgroup-v2.
Is kvm-nx-hpage using significant amounts of memory?
> Why doesn't the scheduler move the task to a CPU that is not being hogged by
> vCPU SCHED_FIFO tasks?
Because cpuset placement is enforced:
CPUSET(7) Linux Programmer's Manual CPUSET(7)
Cpusets are integrated with the sched_setaffinity(2) scheduling affinity mechanism and the
mbind(2) and set_mempolicy(2) memory-placement mechanisms in the kernel. Neither of these
mechanisms let a process make use of a CPU or memory node that is not allowed by that
process's cpuset. If changes to a process's cpuset placement conflict with these other
mechanisms, then cpuset placement is enforced even if it means overriding these other mech‐
anisms. The kernel accomplishes this overriding by silently restricting the CPUs and mem‐
ory nodes requested by these other mechanisms to those allowed by the invoking process's
cpuset. This can result in these other calls returning an error, if for example, such a
call ends up requesting an empty set of CPUs or memory nodes, after that request is
restricted to the invoking process's cpuset.
> The parent cgroup should always have one for
> userspace's own housekeeping.
>
> As an aside, if we decide that KVM's worker threads count as housekeeping,
> you'd still want to bind the kthread to the housekeeping CPUs(*).
This is being done automatically by HK_FLAG_KTHREAD (see
kernel/thread.c).
>
> Paolo
>
> (*) switching from kthread_run to kthread_create+kthread_bind_mask
>
> > Since unbounded kernel threads allowed CPU mask already respects nohz_full
> > CPUs at the time of their setup (because of 9cc5b8656892: "isolcpus: Affine
> > unbound kernel threads to housekeeping cpus"), retain the initial CPU mask
> > for the kthread by stopping its migration to the parent cgroup's effective
> > CPUs.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > ---
> > virt/kvm/kvm_main.c | 20 +++++++++++++++-----
> > 1 file changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 7851f3a1b5f7..87bc193fd020 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -56,6 +56,7 @@
> > #include <asm/processor.h>
> > #include <asm/ioctl.h>
> > #include <linux/uaccess.h>
> > +#include <linux/sched/isolation.h>
> > #include "coalesced_mmio.h"
> > #include "async_pf.h"
> > @@ -5634,11 +5635,20 @@ static int kvm_vm_worker_thread(void *context)
> > if (err)
> > goto init_complete;
> > - err = cgroup_attach_task_all(init_context->parent, current);
> > - if (err) {
> > - kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
> > - __func__, err);
> > - goto init_complete;
> > + /*
> > + * For nohz_full enabled environments, don't migrate the worker thread
> > + * to parent cgroup as its effective mask may have a CPU running in
> > + * nohz_full mode. nohz_full CPUs often run SCHED_FIFO task which could
> > + * result in starvation of the worker thread if it is pinned on the same
> > + * CPU.
> > + */
Actually, we don't want the kthread in the isolated CPUs (irrespective
of nohz_full=, starvation, or anything). Its just about
"don't run a kernel thread on isolated CPUs".
> > + if (!housekeeping_enabled(HK_FLAG_KTHREAD)) {
> > + err = cgroup_attach_task_all(init_context->parent, current);
> > + if (err) {
> > + kvm_err("%s: cgroup_attach_task_all failed with err %d\n",
> > + __func__, err);
> > + goto init_complete;
> > + }
> > }
> > set_user_nice(current, task_nice(init_context->parent));
> >
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
[not found] ` <96f38a69-2ff8-a78c-a417-d32f1eb742be@redhat.com>
@ 2021-10-05 13:21 ` Marcelo Tosatti
2021-10-05 13:49 ` Nitesh Lal
2021-10-05 17:15 ` Paolo Bonzini
0 siblings, 2 replies; 7+ messages in thread
From: Marcelo Tosatti @ 2021-10-05 13:21 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Nitesh Narayan Lal, linux-kernel, kvm, seanjc, vkuznets, tglx,
frederic, mingo, nilal, Wanpeng Li
On Tue, Oct 05, 2021 at 01:25:52PM +0200, Paolo Bonzini wrote:
> On 05/10/21 12:58, Marcelo Tosatti wrote:
> > > There are other effects of cgroups (e.g. memory accounting) than just the
> > > cpumask;
> >
> > Is kvm-nx-hpage using significant amounts of memory?
>
> No, that was just an example (and not a good one indeed, because
> kvm-nx-hpage is not using a substantial amount of either memory or CPU).
> But for example vhost also uses cgroup_attach_task_all, so it should have
> the same issue with SCHED_FIFO?
Yes. Would need to fix vhost as well.
>
> > > Why doesn't the scheduler move the task to a CPU that is not being hogged by
> > > vCPU SCHED_FIFO tasks?
> > Because cpuset placement is enforced:
>
> Yes, but I would expect the parent cgroup to include both isolated CPUs (for
> the vCPU threads) and non-isolated housekeeping vCPUs (for the QEMU I/O
> thread).
Yes, the parent, but why would that matter? If you are in a child
cpuset, you are restricted to the child cpuset mask (and not the
parents).
> The QEMU I/O thread is not hogging the CPU 100% of the time, and
> therefore the nx-recovery thread should be able to run on that CPU.
Yes, but:
1) The cpumask of the parent thread is not inherited
set_cpus_allowed_ptr(task, housekeeping_cpumask(HK_FLAG_KTHREAD));
On __kthread_create_on_node should fail (because its cgroup, the one
inherited from QEMU, contains only isolated CPUs).
(The QEMU I/O thread runs on an isolated CPU, and is moved by libvirt
to HK-cgroup as mentioned before).
2) What if kernel threads that should be pinned to non-isolated CPUs are created
from vcpus?
>
> Thanks,
>
> Paolo
>
> > CPUSET(7) Linux Programmer's Manual CPUSET(7)
> >
> > Cpusets are integrated with the sched_setaffinity(2) scheduling affinity mechanism and the
> > mbind(2) and set_mempolicy(2) memory-placement mechanisms in the kernel. Neither of these
> > mechanisms let a process make use of a CPU or memory node that is not allowed by that
> > process's cpuset. If changes to a process's cpuset placement conflict with these other
> > mechanisms, then cpuset placement is enforced even if it means overriding these other mech‐
> > anisms. The kernel accomplishes this overriding by silently restricting the CPUs and mem‐
> > ory nodes requested by these other mechanisms to those allowed by the invoking process's
> > cpuset. This can result in these other calls returning an error, if for example, such a
> > call ends up requesting an empty set of CPUs or memory nodes, after that request is
> > restricted to the invoking process's cpuset.
> >
> >
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
2021-10-05 13:21 ` Marcelo Tosatti
@ 2021-10-05 13:49 ` Nitesh Lal
2021-10-05 17:15 ` Paolo Bonzini
1 sibling, 0 replies; 7+ messages in thread
From: Nitesh Lal @ 2021-10-05 13:49 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Paolo Bonzini, Nitesh Narayan Lal, linux-kernel, kvm,
Sean Christopherson, Vitaly Kuznetsov, Thomas Gleixner,
Frederic Weisbecker, Ingo Molnar, Wanpeng Li
On Tue, Oct 5, 2021 at 9:22 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Tue, Oct 05, 2021 at 01:25:52PM +0200, Paolo Bonzini wrote:
> > On 05/10/21 12:58, Marcelo Tosatti wrote:
> > > > There are other effects of cgroups (e.g. memory accounting) than just the
> > > > cpumask;
> > >
> > > Is kvm-nx-hpage using significant amounts of memory?
> >
> > No, that was just an example (and not a good one indeed, because
> > kvm-nx-hpage is not using a substantial amount of either memory or CPU).
> > But for example vhost also uses cgroup_attach_task_all, so it should have
> > the same issue with SCHED_FIFO?
>
> Yes. Would need to fix vhost as well.
>
> >
> > > > Why doesn't the scheduler move the task to a CPU that is not being hogged by
> > > > vCPU SCHED_FIFO tasks?
> > > Because cpuset placement is enforced:
> >
> > Yes, but I would expect the parent cgroup to include both isolated CPUs (for
> > the vCPU threads) and non-isolated housekeeping vCPUs (for the QEMU I/O
> > thread).
>
> Yes, the parent, but why would that matter? If you are in a child
> cpuset, you are restricted to the child cpuset mask (and not the
> parents).
Yes, and at the time of cpuset_attach, the task is attached to any one of
the CPUs that are in the effective cpumask.
>
> > The QEMU I/O thread is not hogging the CPU 100% of the time, and
> > therefore the nx-recovery thread should be able to run on that CPU.
>
> Yes, but:
>
> 1) The cpumask of the parent thread is not inherited
>
> set_cpus_allowed_ptr(task, housekeeping_cpumask(HK_FLAG_KTHREAD));
>
> On __kthread_create_on_node should fail (because its cgroup, the one
> inherited from QEMU, contains only isolated CPUs).
>
Just to confirm, do you mean fail only for unbounded kthreads?
--
Thanks
Nitesh
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
2021-10-05 13:21 ` Marcelo Tosatti
2021-10-05 13:49 ` Nitesh Lal
@ 2021-10-05 17:15 ` Paolo Bonzini
2021-10-05 17:57 ` Marcelo Tosatti
1 sibling, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2021-10-05 17:15 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: Nitesh Narayan Lal, linux-kernel, kvm, seanjc, vkuznets, tglx,
frederic, mingo, nilal, Wanpeng Li
On 05/10/21 15:21, Marcelo Tosatti wrote:
>> The QEMU I/O thread is not hogging the CPU 100% of the time, and
>> therefore the nx-recovery thread should be able to run on that CPU.
>
> 1) The cpumask of the parent thread is not inherited
>
> set_cpus_allowed_ptr(task, housekeeping_cpumask(HK_FLAG_KTHREAD));
>
> On __kthread_create_on_node should fail (because its cgroup, the one
> inherited from QEMU, contains only isolated CPUs).
>
> (The QEMU I/O thread runs on an isolated CPU, and is moved by libvirt
> to HK-cgroup as mentioned before).
Ok, that's the part that I missed. So the core issue is that libvirt
moves the I/O thread out of the isolated-CPU cgroup too late? This in
turn is because libvirt gets access to the QEMU monitor too late (the
KVM file descriptor is created when QEMU processes the command line).
Paolo
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker
2021-10-05 17:15 ` Paolo Bonzini
@ 2021-10-05 17:57 ` Marcelo Tosatti
0 siblings, 0 replies; 7+ messages in thread
From: Marcelo Tosatti @ 2021-10-05 17:57 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Nitesh Narayan Lal, linux-kernel, kvm, seanjc, vkuznets, tglx,
frederic, mingo, nilal, Wanpeng Li
On Tue, Oct 05, 2021 at 07:15:43PM +0200, Paolo Bonzini wrote:
> On 05/10/21 15:21, Marcelo Tosatti wrote:
> > > The QEMU I/O thread is not hogging the CPU 100% of the time, and
> > > therefore the nx-recovery thread should be able to run on that CPU.
> >
> > 1) The cpumask of the parent thread is not inherited
> >
> > set_cpus_allowed_ptr(task, housekeeping_cpumask(HK_FLAG_KTHREAD));
> >
> > On __kthread_create_on_node should fail (because its cgroup, the one
> > inherited from QEMU, contains only isolated CPUs).
> >
> > (The QEMU I/O thread runs on an isolated CPU, and is moved by libvirt
> > to HK-cgroup as mentioned before).
>
> Ok, that's the part that I missed. So the core issue is that libvirt moves
> the I/O thread out of the isolated-CPU cgroup too late? This in turn is
> because libvirt gets access to the QEMU monitor too late (the KVM file
> descriptor is created when QEMU processes the command line).
Actually, what i wrote was incorrect: set_cpus_allowed_ptr should
succeed (kthread creation), but cgroup_attach_task_all will
override the kthread mask with the cgroup mask
(which contains isolated CPUs).
Paolo: about your suggestion to use the same CPU for nx-recovery thread
as the I/O thread one: I believe the cpumask of the userspace parent (QEMU I/O
thread) is not inherited by the kernel thread.
We will resend the patchset once more data to justify this is available
(or will if an userspace solution is not possible).
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-10-05 17:59 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04 22:26 [PATCH v1] KVM: isolation: retain initial mask for kthread VM worker Nitesh Narayan Lal
2021-10-05 9:38 ` Paolo Bonzini
2021-10-05 10:58 ` Marcelo Tosatti
[not found] ` <96f38a69-2ff8-a78c-a417-d32f1eb742be@redhat.com>
2021-10-05 13:21 ` Marcelo Tosatti
2021-10-05 13:49 ` Nitesh Lal
2021-10-05 17:15 ` Paolo Bonzini
2021-10-05 17:57 ` Marcelo Tosatti
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).