Re: [PATCH v2] sched: credit2: respect per-vcpu hard affinity

From: Dario Faggioli <dario.faggioli@citrix.com>
Cc: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
	"jtweaver@hawaii.edu" <jtweaver@hawaii.edu>,
	George Dunlap <George.Dunlap@citrix.com>,
	"henric@hawaii.edu" <henric@hawaii.edu>
Subject: Re: [PATCH v2] sched: credit2: respect per-vcpu hard affinity
Date: Fri, 6 Mar 2015 17:02:20 +0000	[thread overview]
Message-ID: <1425661338.16932.52.camel@citrix.com> (raw)
In-Reply-To: <54F9C55F.5070608@eu.citrix.com>

[-- Attachment #1.1: Type: text/plain, Size: 8133 bytes --]

On Fri, 2015-03-06 at 15:18 +0000, George Dunlap wrote:
> On 03/03/2015 03:15 AM, Dario Faggioli wrote:
> > On Sun, 2015-02-08 at 17:45 -1000, Justin T. Weaver wrote:

> >>  /*
> >> + * Use this to avoid having too many cpumask_t structs on the stack
> >> + */
> >> +static cpumask_t **cpumask = NULL;
> >>
> > Not just 'cpumask', please... It's too generic a name. Let's pick up
> > something that makes it easier to understand what's the purpose of this.
> > 
> > I'm not really sure right now what something like that could be...
> > Credit has balance_mask, but we're not going as far as introducing
> > multiple step load balancing here (not with this patch, at least).
> > 
> > Maybe affinity_cpumask, or hard_affinity_cpumask (and we'll rename to
> > something else when introducing soft affinity, if needed).
> 
> Well I think it's just being used as scratch space, right?  Why not
> scratch_mask or something like that?
> 
Fine with me.

> >> +static int get_safe_pcpu(struct csched2_vcpu *svc)
> >> +{
> >>
> > I also don't like the name... __choose_cpu() maybe ?
> 
> I'm not 100% happy with the name, but I think "get_safe_pcpu" is more
> descriptive than just "__choose_cpu".
>
I don't like the "_safe_" part, but that is not a big deal, I certainly
can live with it.

> >> +    cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity, &svc->rqd->active);
> >> +    if ( unlikely(cpumask_empty(csched2_cpumask)) )
> >> +        cpumask_and(csched2_cpumask, svc->vcpu->cpu_hard_affinity,
> >> +            cpupool_online_cpumask(svc->vcpu->domain->cpupool));
> >>
> > VCPU2ONLINE(svc->vcpu) would make the line shorter.
> > 
> > Also, I know I'm the one that suggested this form for the code, but
> > thinking again about it, I'm not sure the first if is worth:
> > 
> >     cpumask_and(csched2_cpumask, &svc->rqd->active, VCPU2ONLINE(svc->vcpu));
> >     cpumask_and(csched2_cpumask, csched2_cpumask, svc->vcpu->cpu_hard_affinity);
> 
> Actually, I was going to say is there any reason not to start with:
> 
> if ( likely(cpumask_test_cpu(svc->vcpu->processor,
> svc->vcpu->cpu_hard_affinity))
>  return svc->vcpu->processor;
> 
> And then go through doing the unions and what-not once we've determined
> that the current processor isn't suitable?
> 
I like the idea, and I think it actually makes sense. I'm trying to
think to a scenario where this can be bugous, and where we actually need
to do the filtering against rqd->active and online up front, but I can't
imagine one.

I think the possible source of trouble is affinity being changed, but
even then, if we were on vcpu->processor, and that still is in our hard
affinity mask, it ought to be right to stay there (and hence George's
suggestion should be fine)... Justin, what do you think?

> > Oh, and, with either yours or my variant... can csched2_cpumask end up
> > empty after the two &&-s? That's important or, below, cpumask_any will
> > return garbage! :-/
> > 
> > From a quick inspection, it does not seem it can, in which case, I'd put
> > down an ASSERT() about this somewhere. OTOH, if it can, I'd look for
> > ways of preventing that to happen before getting here... It seems easier
> > and more correct to do that, rather than trying to figure out what to do
> > here if the mask is empty. :-O
> 
> Yes, we need to go through the code and make sure that we understand
> what our assumptions are, so that people can't crash Xen by doing
> irrational things like making the hard affinity not intersect with the
> cpupool cpus.
> 
True.

Something like that can be figured out and either forbidden or, in
general, addressed in other places, rather than requiring us to care
here. In fact, this seems to me to be what's happening already (see
below).

> If we make this an "unusual slow path", I don't see any reason to make
> the code a bit more resilient by handling cases we don't expect to
> happen.
>
Well, again, true, in a general fashion. However --perhaps because I'm
not sure I'm getting what "unusual slow path" means in this context-- if
we say that we support hard affinity, I see the fact that vCPUs can be
found on pCPUs outside of their hard affinity as a very bad thing.

For instance, people may rely on this for various reasons, e.g., to
achieve a high level of isolation between domains, by partitioning the
hard affinity of their vCPUs accordingly, and this isolation not being
enforced can screw things arbitrarily bad for them, I think.

Whether this is worth exploding it probably debatable (and workload and
use case dependent), but it definitely falls in the "I want to catch
this during testing" (so ==> ASSERT) category, IMO.

> It would be good to try to make sure we don't allow a situation
> where union(hard_affinity, domain->cpupool) is empty, but I'd rather the
> result be something not too bizarre (e.g., picking a random cpu in the
> cpupool) than having the host crash or something.
> 
Yeah, I like robustness... but only up to a certain extent, I think, or
you end up having to deal with all sort of nonsensical situations pretty
much everywhere! :-)

AFAICR, and having another quick look around, I think
intersection(hard_affinity,domain->cpupool) (because with union(), you
really meant intersection(), right? ;-P) can't be empty, and that's The
Right Thing (TM).

This is because, when a domain is moved to a cpupool, it's hard and soft
affinity is just reset to "all" (see sched_move_domain() in schedule.c).
Also, when the set of pCPUs of a cpupool is altered, or on CPU hotplug,
if that alteration is making the intersection empty, again the hard
affinity is reset to "all" (see cpu_disable_scheduler()).

So, dealing with this here with anything else than ASSERTs would at a
minimum look confusing to me. It would make me think that it is indeed
something possible and legitimate, and hence question why the code cited
above does what it does, etc. etc.

So, to summarize: provided the analysis above is verified, I wouldn't
BUG_ON, nor I would try to handle it, and I'd go for an ASSERT.

> >> -    /* FIXME: Pay attention to cpu affinity */                                                                                      
> >> -
> 
> Nice to see those disappear. :-)
> 
:-)

> >> +    /*
> >> +     * Assign new_cpu to vc->processor here to get a call to sched_move_irqs
> >> +     * in schedule.c in case there was a hard affinity change within the same
> >> +     * run queue. vc will not be able to run in certain situations without
> >> +     * this call.
> >> +     */
> >> +    vc->processor = new_cpu;
> >>
> > Oh, and this is what was causing you troubles, in case source and
> > destination runqueue were the same... Help me understand, which call to
> > sched_move_irqs() in schedule.c were we missing? I'd say it is the one
> > in vcpu_migrate(), but that does not seem to care about vc->processor
> > (much rater about new_cpu)... what am I missing?
> > 
> > However, if they are not the same, the call to migrate() will override
> > this right away, won't it? What I mean to say is, if this is required
> > only in case trqd and svc->rqd are equal, can we tweak this part of
> > csched2_vcpu_migrate() as follows?
> > 
> >     if ( trqd != svc->rqd )
> >         migrate(ops, svc, trqd, NOW());
> >     else
> >         vc->processor = new_cpu;
> 
> It does seem a bit weird to me looking at it now that when the generic
> scheduler does vcpu_migrate(), we go through all the hassle of calling
> pick_cpu, and then (if the runqueue happens to change) we end up picking
> *yet another* random cpu within that runqueue.
> 
Not sure I'm getting 100% of what you mean here... Are you complaining
on the fact that, independently from Justin's patch, inside migrate() we
call cpumask_any() instead than using new_cpu from here? If yes, that
makes sense, and it should not be too hard to fix...

> But that's a fix for another time I think.  
>
Agreed (again, if I got what you meant :-) ).

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel