Re: [PATCH v3 3/3] xen/sched: fix cpu hotplug

From: Juergen Gross <jgross@suse.com>
To: Andrew Cooper <Andrew.Cooper3@citrix.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: George Dunlap <George.Dunlap@citrix.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	Gao Ruifeng <ruifeng.gao@intel.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH v3 3/3] xen/sched: fix cpu hotplug
Date: Thu, 1 Sep 2022 14:08:50 +0200	[thread overview]
Message-ID: <7ca10578-6c35-1abf-c418-8abd16cd47b5@suse.com> (raw)
In-Reply-To: <bb4114b7-0376-677b-5176-a4d4d1ef750d@citrix.com>

[-- Attachment #1.1.1: Type: text/plain, Size: 5840 bytes --]

On 01.09.22 14:01, Andrew Cooper wrote:
> On 01/09/2022 07:11, Juergen Gross wrote:
>> On 01.09.22 00:52, Andrew Cooper wrote:
>>> On 16/08/2022 11:13, Juergen Gross wrote:
>>>> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()
>>>
>>> Cpu cpu.
>>>
>>>> with interrupts disabled, thus any memory allocation or freeing must
>>>> be avoided.
>>>>
>>>> Since commit 5047cd1d5dea ("xen/common: Use enhanced
>>>> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
>>>> via an assertion, which will now fail.
>>>>
>>>> Before that commit cpu unplugging in normal configurations was working
>>>> just by chance as only the cpu performing schedule_cpu_rm() was doing
>>>> active work. With core scheduling enabled, however, failures could
>>>> result from memory allocations not being properly propagated to other
>>>> cpus' TLBs.
>>>
>>> This isn't accurate, is it?  The problem with initiating a TLB flush
>>> with IRQs disabled is that you can deadlock against a remote CPU which
>>> is waiting for you to enable IRQs first to take a TLB flush IPI.
>>
>> As long as only one cpu is trying to allocate/free memory during the
>> stop_machine_run() action the deadlock won't happen.
>>
>>> How does a memory allocation out of the xenheap result in a TLB flush?
>>> Even with split heaps, you're only potentially allocating into a new
>>> slot which was unused...
>>
>> Yeah, you are right. The main problem would occur only when a virtual
>> address is changed to point at another physical address, which should be
>> quite unlikely.
>>
>> I can drop that paragraph, as it doesn't really help.
> 
> Yeah, I think that would be best.
> 
>>>
>>>> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
>>>> index 58e082eb4c..2506861e4f 100644
>>>> --- a/xen/common/sched/cpupool.c
>>>> +++ b/xen/common/sched/cpupool.c
>>>> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d,
>>>> struct cpupool *c)
>>>>    }
>>>>      /* Update affinities of all domains in a cpupool. */
>>>> -static void cpupool_update_node_affinity(const struct cpupool *c)
>>>> +static void cpupool_update_node_affinity(const struct cpupool *c,
>>>> +                                         struct affinity_masks *masks)
>>>>    {
>>>> -    struct affinity_masks masks;
>>>> +    struct affinity_masks local_masks;
>>>>        struct domain *d;
>>>>    -    if ( !update_node_aff_alloc(&masks) )
>>>> -        return;
>>>> +    if ( !masks )
>>>> +    {
>>>> +        if ( !update_node_aff_alloc(&local_masks) )
>>>> +            return;
>>>> +        masks = &local_masks;
>>>> +    }
>>>>          rcu_read_lock(&domlist_read_lock);
>>>>          for_each_domain_in_cpupool(d, c)
>>>> -        domain_update_node_aff(d, &masks);
>>>> +        domain_update_node_aff(d, masks);
>>>>          rcu_read_unlock(&domlist_read_lock);
>>>>    -    update_node_aff_free(&masks);
>>>> +    if ( masks == &local_masks )
>>>> +        update_node_aff_free(masks);
>>>>    }
>>>>      /*
>>>
>>> Why do we need this at all?  domain_update_node_aff() already knows what
>>> to do when passed NULL, so this seems like an awfully complicated no-op.
>>
>> You do realize that update_node_aff_free() will do something in case
>> masks
>> was initially NULL?
> 
> By "this", I meant the entire hunk, not just the final if().
> 
> What is wrong with passing the (possibly NULL) masks pointer straight
> down into domain_update_node_aff() and removing all the memory
> allocation in this function?
> 
> But I've also answered that by looking more clearly.  It's about not
> repeating the memory allocations/freeing for each domain in the pool.

Correct.

> Which does make sense as this is a slow path already, but the complexity
> here of having conditionally allocated masks is far from simple.
> 
>>
>>>
>>>> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>>>>    {
>>>>        unsigned int cpu = (unsigned long)hcpu;
>>>>        int rc = 0;
>>>> +    static struct cpu_rm_data *mem;
>>>>          switch ( action )
>>>>        {
>>>>        case CPU_DOWN_FAILED:
>>>> +        if ( system_state <= SYS_STATE_active )
>>>> +        {
>>>> +            if ( mem )
>>>> +            {
>>>
>>> So, this does compile (and indeed I've tested the result), but I can't
>>> see how it should.
>>>
>>> mem is guaranteed to be uninitialised at this point, and ...
>>
>> ... it is defined as "static", so it is clearly NULL initially.
> 
> Oh, so it is.  That is hiding particularly well in plain sight.
> 
> Can it please be this:
> 
> @@ -1014,9 +1014,10 @@ void cf_check dump_runq(unsigned char key)
>   static int cf_check cpu_callback(
>       struct notifier_block *nfb, unsigned long action, void *hcpu)
>   {
> +    static struct cpu_rm_data *mem; /* Protected by cpu_add_remove_lock */
> +
>       unsigned int cpu = (unsigned long)hcpu;
>       int rc = 0;
> -    static struct cpu_rm_data *mem;
>   
>       switch ( action )
>       {
> 
> We already split static and non-static variable like this elsewhere for
> clarity, and identifying the lock which protects the data is useful for
> anyone coming to this in the future.

Fine with me.

> 
> ~Andrew
> 
> P.S. as an observation, this isn't safe for parallel AP booting, but I
> guarantee that this isn't the only example which would want fixing if we
> wanted to get parallel booting working.

You are aware that mem is used only when removing cpus?

Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]