Re: [Xen-devel] [PATCH 13/60] xen/sched: move some per-vcpu items to struct sched_unit

From: Juergen Gross <jgross@suse.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>, Wei Liu <wl@xen.org>,
	Konrad Wilk <konrad.wilk@oracle.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Tim Deegan <tim@xen.org>, Dario Faggioli <dfaggioli@suse.com>,
	Julien Grall <julien.grall@arm.com>,
	Meng Xu <mengxu@cis.upenn.edu>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Ian Jackson <ian.jackson@citrix.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH 13/60] xen/sched: move some per-vcpu items to struct sched_unit
Date: Tue, 2 Jul 2019 08:30:11 +0200	[thread overview]
Message-ID: <0bb81eb9-8303-2dae-2fba-28c73ac74050@suse.com> (raw)
In-Reply-To: <de741925-b823-92ee-c9be-c4cc55da859d@suse.com>

On 01.07.19 17:46, Jan Beulich wrote:
> On 01.07.2019 17:10, Juergen Gross wrote:
>> On 01.07.19 16:08, Jan Beulich wrote:
>>>>>> On 28.05.19 at 12:32, <jgross@suse.com> wrote:
>>>> @@ -155,8 +156,8 @@ static void nmi_mce_softirq(void)
>>>>         * Set the tmp value unconditionally, so that the check in the iret
>>>>         * hypercall works.
>>>>         */
>>>> -    cpumask_copy(st->vcpu->cpu_hard_affinity_tmp,
>>>> -                 st->vcpu->cpu_hard_affinity);
>>>> +    cpumask_copy(st->vcpu->sched_unit->cpu_hard_affinity_tmp,
>>>> +                 st->vcpu->sched_unit->cpu_hard_affinity);
>>>
>>> Aiui this affects all vCPU-s in the unit, which is unlikely to be what we
>>> want here: There's now only one cpu_hard_affinity_tmp for all vCPU-s
>>> in the unit, yet every vCPU in there may want to make use of the
>>> field in parallel.
>>
>> Hmm, yes, we'll need a usage bitmask.
>>
>> Please note that affecting all vcpus in the unit is per design. With
>> multiple vcpus of a unit needing this feature in parallel there is no
>> way they can have different needs regarding temporary affinity.
> 
> But how will this work? I.e. how will all vCPU-s in a unit get
> their temporary affinity pointing to the one specific pCPU in question?

The _unit_ is pinned, so all the vcpus in that unit are pinned, too.

> It's not just the "all at the same time" that I don't see working here,
> I'm also having trouble seeing how the potential cross-core or cross-
> node movement that's apparently needed here would end up working. I'm

The unit is moved to another core via normal scheduling mechanisms. As
switching context is synchronized (see patch 35) all vcpus of a unit are
moved together.

> not going to exclude that the set of possible pCPU-s a vCPU needs to
> move to here is still within the unit, but then I'd expect assertions
> to that effect to be added.

Okay, will do.

> 
>>>> --- a/xen/common/domain.c
>>>> +++ b/xen/common/domain.c
>>>> @@ -125,11 +125,6 @@ static void vcpu_info_reset(struct vcpu *v)
>>>>    static void vcpu_destroy(struct vcpu *v)
>>>>    {
>>>> -    free_cpumask_var(v->cpu_hard_affinity);
>>>> -    free_cpumask_var(v->cpu_hard_affinity_tmp);
>>>> -    free_cpumask_var(v->cpu_hard_affinity_saved);
>>>> -    free_cpumask_var(v->cpu_soft_affinity);
>>>> -
>>>>        free_vcpu_struct(v);
>>>>    }
>>>> @@ -153,12 +148,6 @@ struct vcpu *vcpu_create(
>>>>        grant_table_init_vcpu(v);
>>>> -    if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) ||
>>>> -         !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) ||
>>>> -         !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) ||
>>>> -         !zalloc_cpumask_var(&v->cpu_soft_affinity) )
>>>> -        goto fail;
>>>
>>> Seeing these, I'm actually having trouble understanding how you mean
>>> to retain the user visible interface behavior here: If you only store an
>>> affinity per sched unit, then how are you meaning to honor the vCPU
>>> granular requests coming in?
>>
>> With core scheduling it is only possible to set (virtual) core
>> affinities. Whenever an affinity of a vcpu is being set it will set the
>> affinity of the whole unit.
> 
> Hmm, that's indeed what I was deducing, but how will we sell this
> to people actually fiddling with vCPU affinities? I foresee getting
> bug reports that the respective xl command(s) do(es)n't do anymore
> what it used to do.

The new behavior must be documented, sure.

> 
>>>> --- a/xen/include/xen/sched-if.h
>>>> +++ b/xen/include/xen/sched-if.h
>>>> @@ -438,11 +438,11 @@ static inline cpumask_t* cpupool_domain_cpumask(struct domain *d)
>>>>     * * The hard affinity is not a subset of soft affinity
>>>>     * * There is an overlap between the soft and hard affinity masks
>>>>     */
>>>> -static inline int has_soft_affinity(const struct vcpu *v)
>>>> +static inline int has_soft_affinity(const struct sched_unit *unit)
>>>>    {
>>>> -    return v->soft_aff_effective &&
>>>> -           !cpumask_subset(cpupool_domain_cpumask(v->domain),
>>>> -                           v->cpu_soft_affinity);
>>>> +    return unit->soft_aff_effective &&
>>>> +           !cpumask_subset(cpupool_domain_cpumask(unit->vcpu->domain),
>>>> +                           unit->cpu_soft_affinity);
>>>>    }
>>>
>>> Okay, at the moment there looks to be a 1:1 relationship between sched
>>> units and vCPU-s. This would - at this point of the series - invalidate most
>>> my earlier comments. However, in patch 57 I don't see how this unit->vcpu
>>> mapping would get broken, and I can't seem to identify any other patch
>>> where this might be happening. Looking at the github branch I also get the
>>> impression that the struct vcpu * pointer out of struct sched_unit survives
>>> until the end of the series, which doesn't seem right to me.
>>
>> It is right. The vcpu pointer in the sched_unit is pointing to the first
>> vcpu of the unit at the end of the series. Further vcpus are found via
>> v->next_in_list.
> 
> I'm afraid this sets us up for misunderstanding and misuse. I don't
> think there should be a straight struct vcpu * out of struct sched_unit.

That was the most effective way to do it. What are you suggesting?

> 
>>>> @@ -980,7 +978,7 @@ static inline bool is_hvm_vcpu(const struct vcpu *v)
>>>>    static inline bool is_hwdom_pinned_vcpu(const struct vcpu *v)
>>>>    {
>>>>        return (is_hardware_domain(v->domain) &&
>>>> -            cpumask_weight(v->cpu_hard_affinity) == 1);
>>>> +            cpumask_weight(v->sched_unit->cpu_hard_affinity) == 1);
>>>>    }
>>>
>>> Seeing this - how is pinning (by command line option or by Dom0
>>> doing this on itself transiently) going to work with core scheduling?
>>
>> In the end only the bit of the first vcpu of a unit will be set in the
>> affinity masks, affecting all vcpus of the unit.
> 
> I'm confused - what "bit of the first vcpu of a unit" are you referring
> to?

Sorry, I meant to rewrite this sentence and forgot before sending.

But I think below explanation is making it rather clear.

> To give an example of what I meant with my earlier reply: How is Dom0
> requesting its vCPU 5 to be pinned to pCPU 3 going to be satisfied,
> independent of the sched unit that vCPU 5 is associated with? Is the
> whole sched unit getting moved over then? If so, what if another vCPU
> in the same sched unit at the same time requests to be pinned to pCPU
> 17, on a different node/socket?

Yes, the whole sched unit will be moved.

And in case a unit is already pinned and a conflicting request is made
the new request will either override the old pinning (in case that was
a "normal" pinning via xl command) or it will be rejected (in case the
old pinning was done via SCHEDOP_pin_override).

Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel