On Thu, 2018-01-04 at 13:05 +0000, Wei Liu wrote:
> From: Roger Pau Monne <roger.pau@citrix.com>
> 
> Avoid scheduling vCPUs that are blocked, there's no point in
> assigning
> them to a pCPU because they are not going to run anyway.
> 
> Since blocked vCPUs are not assigned to pCPUs after this change,
> force
> a rescheduling when a vCPU is brought up if it's on the waitqueue.
> Also when scheduling try to pick a vCPU from the runqueue if the pCPU
> is running idle.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Dario Faggioli <raistlin@linux.it>
> ---
> Changes since v1:
>  - Force a rescheduling when a vCPU is brought up.
>  - Try to pick a vCPU from the runqueue if running the idle vCPU.
>
As noted by Jan already, there's a mixing of "blocked" and "down" (or
offline).

In the null scheduler, a vCPU that is assigned to a pCPU, is free to
block and wake-up as many time as it wants (quite obviously). And when
it blocks, the pCPU will just stay idle.

There's no such thing of pulling on the CPU another vCPU, either from
the waitqueue or from anywhere else. That's the whole point of the
scheduler, actually.

Now, I'm not quite sure whether or not this can be a problem in the
"shim scenario". If it is, we have to think of a solution that does not
totally defeat the purpose of the scheduler when used baremetal.

Or use another scheduler, perhaps configuring static 1:1 pinning. Null
seems a great fit for this use case to me, so, I'd say, let's try to
find a nice and cool way to use it. :-)

> ---
>  xen/common/sched_null.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/common/sched_null.c b/xen/common/sched_null.c
> index b4a24baf8e..bacfb31cb3 100644
> --- a/xen/common/sched_null.c
> +++ b/xen/common/sched_null.c
> @@ -574,6 +574,8 @@ static void null_vcpu_wake(const struct scheduler
> *ops, struct vcpu *v)
>      {
>          /* Not exactly "on runq", but close enough for reusing the
> counter */
>          SCHED_STAT_CRANK(vcpu_wake_onrunq);
> +        /* Force a rescheduling in case some CPU is idle can pick
> this vCPU */
> +        cpumask_raise_softirq(&cpu_online_map, SCHEDULE_SOFTIRQ);
>          return;
>
This needs to become 'the cpus of vcpu->domain 's cpupool'.

I appreciate that this is fine, when running as shim, where you
certainly don't use cpupools. But when this run as baremetal, if we use
cpu_online_map, basically _all_ the online CPUs --even the ones that
are in another pool, under a different scheduler-- will be forced to
reschedule. And we don't want that.

I'm not also 100% convinced that this must/can live here. Basially,
you're saying that vcpu_wake() is called on a vCPU that happens to be
in the waitqueue, we should reschedule. And, AFAIUI, this is to cover
the case of a vCPU of the L2 guest comes online.

Well, it may even be technically fine. Still, if what we want to deal
with is vCPU onlining, I would prefer to at least trying find a place
which is more related to the onlining path, than to the wakeup path.

If you confirm your intent, I can have a look at the code and try to
identify such better place...

> @@ -761,9 +763,10 @@ static struct task_slice null_schedule(const
> struct scheduler *ops,
>      /*
>       * We may be new in the cpupool, or just coming back online. In
> which
>       * case, there may be vCPUs in the waitqueue that we can assign
> to us
> -     * and run.
> +     * and run. Also check whether this CPU is running idle, in
> which case try
> +     * to pick a vCPU from the waitqueue.
>       */
> -    if ( unlikely(ret.task == NULL) )
> +    if ( unlikely(ret.task == NULL || ret.task == idle_vcpu[cpu]) )
>
I don't think I understand this. I may be a bit rusty, but are you sure
that, on an idle pCPU, ret.task is idle_vcpu at this point in this
function? I don't think it is.

Also, I'm quite sure this may mess up things for tasklets. In fact, one
case when ret.task is idle_vcpu here, if I have just forced it to be
so, in order to run a tasklet. But with this, we scan the waitqueue
instead, and may end up running something else.

> @@ -781,6 +784,10 @@ static struct task_slice null_schedule(const
> struct scheduler *ops,
>          {
>              list_for_each_entry( wvc, &prv->waitq, waitq_elem )
>              {
> +                if ( test_bit(_VPF_down, &wvc->vcpu->pause_flags) )
> +                    /* Skip vCPUs that are down. */
> +                    continue;
> +
So, yes, I think things like this are what we want. As said above for
the wakeup case, though, I'd prefer to find a way to avoid that offline
vCPUs ends up in the waitqueue, rather than having to skip them.

Side note, is_vcpu_online() can be used for the test.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/