Re: [PATCH v3 3/3] xen/sched: fix cpu hotplug

From: Juergen Gross <jgross@suse.com>
To: Andrew Cooper <Andrew.Cooper3@citrix.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: George Dunlap <George.Dunlap@citrix.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	Gao Ruifeng <ruifeng.gao@intel.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH v3 3/3] xen/sched: fix cpu hotplug
Date: Thu, 1 Sep 2022 08:11:14 +0200	[thread overview]
Message-ID: <94576d45-39c2-a786-2fe2-5effb16caf68@suse.com> (raw)
In-Reply-To: <096ed545-f268-ba45-6333-ed51d20fc99c@citrix.com>

[-- Attachment #1.1.1: Type: text/plain, Size: 6454 bytes --]

On 01.09.22 00:52, Andrew Cooper wrote:
> On 16/08/2022 11:13, Juergen Gross wrote:
>> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()
> 
> Cpu cpu.
> 
>> with interrupts disabled, thus any memory allocation or freeing must
>> be avoided.
>>
>> Since commit 5047cd1d5dea ("xen/common: Use enhanced
>> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
>> via an assertion, which will now fail.
>>
>> Before that commit cpu unplugging in normal configurations was working
>> just by chance as only the cpu performing schedule_cpu_rm() was doing
>> active work. With core scheduling enabled, however, failures could
>> result from memory allocations not being properly propagated to other
>> cpus' TLBs.
> 
> This isn't accurate, is it?  The problem with initiating a TLB flush
> with IRQs disabled is that you can deadlock against a remote CPU which
> is waiting for you to enable IRQs first to take a TLB flush IPI.

As long as only one cpu is trying to allocate/free memory during the
stop_machine_run() action the deadlock won't happen.

> How does a memory allocation out of the xenheap result in a TLB flush?
> Even with split heaps, you're only potentially allocating into a new
> slot which was unused...

Yeah, you are right. The main problem would occur only when a virtual
address is changed to point at another physical address, which should be
quite unlikely.

I can drop that paragraph, as it doesn't really help.

> 
>> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
>> index 228470ac41..ffb2d6202b 100644
>> --- a/xen/common/sched/core.c
>> +++ b/xen/common/sched/core.c
>> @@ -3260,6 +3260,17 @@ static struct cpu_rm_data *schedule_cpu_rm_alloc(unsigned int cpu)
>>       if ( !data )
>>           goto out;
>>   
>> +    if ( aff_alloc )
>> +    {
>> +        if ( !update_node_aff_alloc(&data->affinity) )
> 
> I spent ages trying to figure out what this was doing, before realising
> the problem is the function name.
> 
> alloc (as with free) is the critical piece of information and needs to
> come first.  The fact we typically pass the result to
> update_node_aff(inity) isn't relevant, and becomes actively wrong here
> when we're nowhere near.
> 
> Patch 1 needs to name these helpers:
> 
> bool alloc_affinity_masks(struct affinity_masks *affinity);
> void free_affinity_masks(struct affinity_masks *affinity);
> 
> and then patches 2 and 3 become far easier to follow.
> 
> Similarly in patch 2, the new helpers need to be
> {alloc,free}_cpu_rm_data() to make sense.  These have nothing to do with
> scheduling.
> 
> Also, you shouldn't introduce the helpers static in patch 2 and then
> turn them non-static in patch 3.  That just adds unnecessary churn to
> the complicated patch.

Okay to all of above.

> 
>> +        {
>> +            XFREE(data);
>> +            goto out;
>> +        }
>> +    }
>> +    else
>> +        memset(&data->affinity, 0, sizeof(data->affinity));
> 
> I honestly don't think it is worth optimising xzalloc() -> xmalloc()
> for the cognitive complexity of having this logic here.

I don't mind either way. This logic is the result of one of Jan's comments.

> 
>> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
>> index 58e082eb4c..2506861e4f 100644
>> --- a/xen/common/sched/cpupool.c
>> +++ b/xen/common/sched/cpupool.c
>> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d, struct cpupool *c)
>>   }
>>   
>>   /* Update affinities of all domains in a cpupool. */
>> -static void cpupool_update_node_affinity(const struct cpupool *c)
>> +static void cpupool_update_node_affinity(const struct cpupool *c,
>> +                                         struct affinity_masks *masks)
>>   {
>> -    struct affinity_masks masks;
>> +    struct affinity_masks local_masks;
>>       struct domain *d;
>>   
>> -    if ( !update_node_aff_alloc(&masks) )
>> -        return;
>> +    if ( !masks )
>> +    {
>> +        if ( !update_node_aff_alloc(&local_masks) )
>> +            return;
>> +        masks = &local_masks;
>> +    }
>>   
>>       rcu_read_lock(&domlist_read_lock);
>>   
>>       for_each_domain_in_cpupool(d, c)
>> -        domain_update_node_aff(d, &masks);
>> +        domain_update_node_aff(d, masks);
>>   
>>       rcu_read_unlock(&domlist_read_lock);
>>   
>> -    update_node_aff_free(&masks);
>> +    if ( masks == &local_masks )
>> +        update_node_aff_free(masks);
>>   }
>>   
>>   /*
> 
> Why do we need this at all?  domain_update_node_aff() already knows what
> to do when passed NULL, so this seems like an awfully complicated no-op.

You do realize that update_node_aff_free() will do something in case masks
was initially NULL?

> 
>> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>>   {
>>       unsigned int cpu = (unsigned long)hcpu;
>>       int rc = 0;
>> +    static struct cpu_rm_data *mem;
>>   
>>       switch ( action )
>>       {
>>       case CPU_DOWN_FAILED:
>> +        if ( system_state <= SYS_STATE_active )
>> +        {
>> +            if ( mem )
>> +            {
> 
> So, this does compile (and indeed I've tested the result), but I can't
> see how it should.
> 
> mem is guaranteed to be uninitialised at this point, and ...

... it is defined as "static", so it is clearly NULL initially.

> 
>> +                schedule_cpu_rm_free(mem, cpu);
>> +                mem = NULL;
>> +            }
>> +            rc = cpupool_cpu_add(cpu);
>> +        }
>> +        break;
>>       case CPU_ONLINE:
>>           if ( system_state <= SYS_STATE_active )
>>               rc = cpupool_cpu_add(cpu);
>> @@ -1019,12 +1038,31 @@ static int cf_check cpu_callback(
>>       case CPU_DOWN_PREPARE:
>>           /* Suspend/Resume don't change assignments of cpus to cpupools. */
>>           if ( system_state <= SYS_STATE_active )
>> +        {
>>               rc = cpupool_cpu_remove_prologue(cpu);
>> +            if ( !rc )
>> +            {
>> +                ASSERT(!mem);
> 
> ... here, and each subsequent assertion too.
> 
> Given that I tested the patch and it does fix the IRQ assertion, I can
> only imagine that it works by deterministically finding stack rubble
> which happens to be 0.

Not really, as mem isn't on the stack. :-)

Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]