On lun, 2013-12-09 at 22:08 -1000, Justin Weaver wrote:

> Hello,
> 
Hey... here I am! It took a while, eh? Well, sorry for that. :-(

First of all, a bit of context, for the people in Cc (I added Marcus)
and the list. Basically, Justin is looking into implementing soft
affinity for credit2.

As we know, credit2 lacks hard affinity too. We'll see along the way
whether it would be easy enough to add that too.

> On Sat, Nov 30, 2013 at 10:18 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>         I'll have to re-look at the details of credit2 about load
>         balance and
>         migration between CPUs/runqueues but it looks like we need to
>         have
>         something allowing us to honour pinning/affinity _within_ the
>         same
>         runqueue, anyway, don't we? I mean, even if you implement
>         per-L2
>         runqueues, that would still span more than one CPU, and the
>         user may
>         well want to pin a vCPU to only one (or in general a subset)
>         of them.
> 
> 
> Yes, I agree. Just looking for some feedback before I attempt a patch.
>  Some of the functions I think need updating for hard/soft affinity...
> 
Ok, first of all, about one global runqueue VS. one runqueue per
socket/L2. I'm also seeing what you are reporting, i.e., all the cpus
being assigned to one same runqueue, despite the box having to socket:

cpu_topology           :
cpu:    core    socket     node
  0:       0        0        0
  1:       1        0        0
  2:       2        0        0
  3:       3        0        0
  4:       0        1        1
  5:       1        1        1
  6:       2        1        1
  7:       3        1        1

and despite this piece of code in sched_credit2:init_pcpu():

    /* Figure out which runqueue to put it in */
    /* NB: cpu 0 doesn't get a STARTING callback, so we hard-code it to runqueue 0. */
    if ( cpu == 0 )
        rqi = 0;
    else
        rqi = cpu_to_socket(cpu);

    if ( rqi < 0 )
    {
        printk("%s: cpu_to_socket(%d) returned %d!\n",
               __func__, cpu, rqi);
        BUG();
    }

    rqd=prv->rqd + rqi;

    printk("Adding cpu %d to runqueue %d\n", cpu, rqi);
    if ( ! cpumask_test_cpu(rqi, &prv->active_queues) )
    {
        printk(" First cpu on runqueue, activating\n");
        activate_runqueue(prv, rqi);
    }

which, AFAICT, ought to be creating two runqueues.

Weird... Let's raise that in a separate e-mail/thread, ok?

> runq_candidate needs to be updated. It decides which vcpu from the run
> queue to run next on a given pcpu. Currently it only takes credit into
> account. Considering hard affinity should be simple enough.
>
Most likely. I guess you're thinking at something like, instead of just
looking at the next guy in the queue, scan it until we find one with an
hard affinity suitable for the cpu we're dealing with, is that so?

If yes, I guess it would be fine, although at some point we'd want to
figure out the performance implications of such a scan. Also, I'm not
too familiar with credit2, but it's very well possible that, even after
having modified this, there are other places where hard affinity needs
to be considered (saying this just as a 'warning' toward claiming
victory too soon ;-P ).

> For soft, what if it first looked through the run queue in credit
> order at only vcpus that prefer to run on the given processor and had
> a certain amount of credit, and if none were found it then considered
> the whole run queue considering only hard affinity and credit?
> 
What do you mean with "have a certain amount of credit" ? Apart from
that, seems a reasonable line of thinking... Looks similar to the 2
phases approach I took for credit1.

Again, some performance investigation would be necessary, at some point.
Of course, we can first come up with an implementation, and then start
benchmarking and optimizing (actually, optimizing too early usually lead
to very bad situations!).

I wonder whether we could do something clever, knowing that we have one
runqueue per socket. I mean, of course both hard and and soft affinities
can span sockets, nodes (or whatever), they can be sparse, etc. However,
it would be fairly common to have some sort of relationship between them
and the topology (being put in place by either the user explicitly or
the toolstack automatically), to the point that it may be worth
optimizing a bit for this case.

> runq_assign assumes that the run queue associated with vcpu->processor
> is OK for vcpu to run on. If considering affinity, I'm not sure if
> that can be assumed. 
>
Yes, because affinity may have changed in the meantime, right? Which
means, in case of hard, you'd be violating an explicit user request. In
case of soft, it won't be equally bad, but it would be nice to check
whether we can transition to a better system state (i.e., respecting the
soft affinity too).

> I probably need to dig further into schedule.c to see where
> vcpu->processor is being assigned initially. Anyway, with only one run
> queue this doesn't matter for now.
> 
Well, sure, but I actually would recommend looking more in
sched_credit.c rather than there. What I mean is, when implementing soft
affinity for credit1, I didn't need to touch anything in schedule.c
(well, for sure anything related to vcpu->processor). Similarly, talking
about hard affinity, the fact that it is possible to have a scheduler
that implements hard affinity (credit1) and one that does not (credit2),
makes me thinking that the code in schedule.c is general enough to
support both, and that the whole game is played inside the actual
scheduler source files (i.e., either sched_credit.c or sched_credit2.c)

> choose_cpu / migrate will need to be updated, but currently migrate
> never gets called because there's only one run queue.
> 
choose_cpu, yes, definitely. It's basically the core logic implementing
csched_cpu_pick, which is key in chosing a cpu where to run a vcpu, and
both hard and soft affinity needs to be taken into account when doing
that.

About migrate, it surely needs tweaking too. That another place where, I
think, we can make good use of the information we have about the link
between the runqueus and the topology, perhaps after having check
whether the affinity (whichever one) is in such a form that plays well
enough with that.

Thanks for your interest in this work! :-)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)