Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

From: George Dunlap <George.Dunlap@eu.citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Andre Przywara <andre.przywara@amd.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Anil Madhavapeddy <anil@recoil.org>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	xen-devel@lists.xen.org, Jan Beulich <JBeulich@suse.com>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>
Subject: Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`
Date: Tue, 9 Oct 2012 17:29:54 +0100	[thread overview]
Message-ID: <CAFLBxZb-366Bzkd+Ar3z5p+EfrX6VpesVWKv-PnGhhD7HWLS7g@mail.gmail.com> (raw)
In-Reply-To: <ca2fa958879bbffa9bc6.1349446101@Solace>

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> As vcpu-affinity tells where vcpus can run, node-affinity tells
> where a domain's vcpus prefer to run. Respecting vcpu-affinity is
> the primary concern, but honouring node-affinity will likely
> result in some performances benefit.
>
> This change modifies the vcpu load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferrable
> for the domain to run. If that fails in finding a valid CPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Looking at the load-balancing code, it makes me think that there is
probably some interesting work to do there in the future; but I think
this patch can go in as it is for now.  So

(Once others' comments are addressed)
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

Do scroll down to read my comments on load balancing...

> @@ -1211,30 +1289,48 @@ csched_runq_steal(int peer_cpu, int cpu,
>       */
>      if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
>      {
> -        list_for_each( iter, &peer_pcpu->runq )
> +        int balance_step;
> +
> +        /*
> +         * Take node-affinity into account. That means, for all the vcpus
> +         * in peer_pcpu's runq, check _first_ if their node-affinity allows
> +         * them to run on cpu. If not, retry the loop considering plain
> +         * vcpu-affinity. Also, notice that as soon as one vcpu is found,
> +         * balancing is considered done, and the vcpu is returned to the
> +         * caller.
> +         */
> +        for_each_csched_balance_step(balance_step)
>          {
> -            speer = __runq_elem(iter);
> +            list_for_each( iter, &peer_pcpu->runq )
> +            {
> +                cpumask_t balance_mask;
>
> -            /*
> -             * If next available VCPU here is not of strictly higher
> -             * priority than ours, this PCPU is useless to us.
> -             */
> -            if ( speer->pri <= pri )
> -                break;
> +                speer = __runq_elem(iter);
>
> -            /* Is this VCPU is runnable on our PCPU? */
> -            vc = speer->vcpu;
> -            BUG_ON( is_idle_vcpu(vc) );
> +                /*
> +                 * If next available VCPU here is not of strictly higher
> +                 * priority than ours, this PCPU is useless to us.
> +                 */
> +                if ( speer->pri <= pri )
> +                    break;
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> -            {
> -                /* We got a candidate. Grab it! */
> -                CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
> -                CSCHED_STAT_CRANK(migrate_queued);
> -                WARN_ON(vc->is_urgent);
> -                __runq_remove(speer);
> -                vc->processor = cpu;
> -                return speer;
> +                /* Is this VCPU runnable on our PCPU? */
> +                vc = speer->vcpu;
> +                BUG_ON( is_idle_vcpu(vc) );
> +
> +                if ( csched_balance_cpumask(vc, balance_step, &balance_mask) )
> +                    continue;

This will have the effect that a vcpu with any node affinity at all
will be stolen before a vcpu with no node affinity: i.e., if you have
a system with 4 nodes, and one vcpu has an affinity to nodes 1-2-3,
another has affinity with only 1, and another has an affinity to all
4, the one which has an affinity to all 4 will be passed over the
first round, while either of the first ones might be nabbed (depending
on what pcpu they're on).

Furthermore, the effect of this whole thing (if I'm reading it right)
will be to go through *each runqueue* twice, rather than checking all
cpus for vcpus with good node affinity, and then all cpus for vcpus
with good cpumasks.

It seems like it would be better to check:
* This node for node-affine work to steal
* Other nodes for node-affine work to steal
* All nodes for cpu-affine work to steal.

Ideally, the search would terminate fairly quickly with the first set,
meaning that in the common case we never even check other nodes.

Going through the cpu list twice means trying to grab the scheduler
lock for each cpu twice; but hopefully that would be made up for by
having a shorter list.

Thoughts?

Like I said, I think this is something to put on our to-do list; this
patch should go in so we can start testing it as soon as possible.

 -George

> +
> +                if (__csched_vcpu_is_migrateable(vc, cpu, &balance_mask))
> +                {
> +                    /* We got a candidate. Grab it! */
> +                    CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
> +                    CSCHED_STAT_CRANK(migrate_queued);
> +                    WARN_ON(vc->is_urgent);
> +                    __runq_remove(speer);
> +                    vc->processor = cpu;
> +                    return speer;
> +                }
>              }
>          }
>      }
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel