[PATCH 0/2] workqueue: fix a bug when numa mapping is changed

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
@ 2015-03-26  2:17 Gu Zheng
  2015-03-26  2:17 ` [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent Gu Zheng
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Gu Zheng @ 2015-03-26  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: tj, laijs, isimatu.yasuaki, kamezawa.hiroyu, tangchen, guz.fnst,
	izumi.taku

Yasuaki Ishimatsu found that with node online/offline, cpu<->node
relationship is established. Because workqueue uses a info which was
established at boot time, but it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
  ==
     SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
      cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
    1, min order: 0
      node 0: slabs: 6172, objs: 259224, free: 245741
      node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid <--> node relationship is persistent, so the root cause is the
cpu-id <-> lapicid mapping is not persistent (because the currently implementation
always choose the first free cpu id for the new added cpu), so if we can build
persistent cpu-id <-> lapicid relationship, this problem will be fixed.

Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.

Gu Zheng (2):
  x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
  workqueue: update per cpu workqueue's numa affinity when cpu
    preparing online

 arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
 kernel/workqueue.c          |    1 +
 2 files changed, 31 insertions(+), 1 deletions(-)

-- 
1.7.7

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-26  2:17 [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Gu Zheng
@ 2015-03-26  2:17 ` Gu Zheng
  2015-03-26  3:19   ` Kamezawa Hiroyuki
  2015-03-26  2:17 ` [PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online Gu Zheng
  2015-03-26  3:12 ` [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Kamezawa Hiroyuki
  2 siblings, 1 reply; 23+ messages in thread
From: Gu Zheng @ 2015-03-26  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: tj, laijs, isimatu.yasuaki, kamezawa.hiroyu, tangchen, guz.fnst,
	izumi.taku

Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
occurred, and it causes the wq sub-system allocation failture:
  ==
     SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
      cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
    1, min order: 0
      node 0: slabs: 6172, objs: 259224, free: 245741
      node 1: slabs: 3261, objs: 136962, free: 127656
  ==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.

Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
---
 arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
 	apic_write(APIC_LVT1, value);
 }
 
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+	[0 ... MAX_LOCAL_APIC - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+	int cpuid;
+
+	cpuid = apicid_to_x86_cpu[apicid];
+	if (cpuid == -1)
+		cpuid = cpumask_next_zero(-1, &cpuid_mask);
+
+	return cpuid;
+}
+
 int generic_processor_info(int apicid, int version)
 {
 	int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
 		 */
 		cpu = 0;
 	} else
-		cpu = cpumask_next_zero(-1, cpu_present_mask);
+		cpu = get_cpuid(apicid);
+
+	/* Store the mapping */
+	apicid_to_x86_cpu[apicid] = cpu;
 
 	/*
 	 * Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
 	early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
 		apic->x86_32_early_logical_apicid(cpu);
 #endif
+	/* Mark this cpu id as uesed (already mapping a local apic id) */
+	cpumask_set_cpu(cpu, &cpuid_mask);
 	set_cpu_possible(cpu, true);
 	set_cpu_present(cpu, true);
 
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-26  2:17 ` [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent Gu Zheng
@ 2015-03-26  3:19   ` Kamezawa Hiroyuki
  2015-03-26  4:55     ` Gu Zheng
  0 siblings, 1 reply; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-03-26  3:19 UTC (permalink / raw)
  To: Gu Zheng, linux-kernel; +Cc: tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/03/26 11:17, Gu Zheng wrote:
> Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
> the relationship will be changed if the cpu/node hotplug happenned, because we
> always choose the first free cpuid for the hot added cpu (whether it is new-add
> or re-add), so this the cpuid <--> node mapping changed if node hot plug
> occurred, and it causes the wq sub-system allocation failture:
>    ==
>       SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>        cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
>      1, min order: 0
>        node 0: slabs: 6172, objs: 259224, free: 245741
>        node 1: slabs: 3261, objs: 136962, free: 127656
>    ==
> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
> present, and never change it.
> 
> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
> ---
>   arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>   1 files changed, 30 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index ad3639a..d539ebc 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>   	apic_write(APIC_LVT1, value);
>   }
>   
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
> + * Do not clear the mapping even if cpu hot removed.
> + * */
> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
> +	[0 ... MAX_LOCAL_APIC - 1] = -1,
> +};


This patch cannot handle x2apic, which is 32bit.

As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.

But you can't create inifinit array.

If you can't allocate the array dynamically, How about adding

 static int cpuid_to_apicid[MAX_CPU] = {}

or using idr library ? (please see lib/idr.c)

I guess you can update this map after boot(after mm initialization)
and make use of idr library.

About this patch, Nack.

-Kame



> +
> +/*
> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
> + * */
> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
> +
> +static int get_cpuid(int apicid)
> +{
> +	int cpuid;
> +
> +	cpuid = apicid_to_x86_cpu[apicid];
> +	if (cpuid == -1)
> +		cpuid = cpumask_next_zero(-1, &cpuid_mask);
> +
> +	return cpuid;
> +}
> +
>   int generic_processor_info(int apicid, int version)
>   {
>   	int cpu, max = nr_cpu_ids;
> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>   		 */
>   		cpu = 0;
>   	} else
> -		cpu = cpumask_next_zero(-1, cpu_present_mask);
> +		cpu = get_cpuid(apicid);
> +
> +	/* Store the mapping */
> +	apicid_to_x86_cpu[apicid] = cpu;
>   
>   	/*
>   	 * Validate version
> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>   	early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>   		apic->x86_32_early_logical_apicid(cpu);
>   #endif
> +	/* Mark this cpu id as uesed (already mapping a local apic id) */
> +	cpumask_set_cpu(cpu, &cpuid_mask);
>   	set_cpu_possible(cpu, true);
>   	set_cpu_present(cpu, true);
>   
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-26  3:19   ` Kamezawa Hiroyuki
@ 2015-03-26  4:55     ` Gu Zheng
  2015-03-26 15:13       ` Tejun Heo
  2015-03-26 16:31       ` Kamezawa Hiroyuki
  0 siblings, 2 replies; 23+ messages in thread
From: Gu Zheng @ 2015-03-26  4:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hi Kame-san,
On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/26 11:17, Gu Zheng wrote:
>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
>> the relationship will be changed if the cpu/node hotplug happenned, because we
>> always choose the first free cpuid for the hot added cpu (whether it is new-add
>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>> occurred, and it causes the wq sub-system allocation failture:
>>    ==
>>       SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>        cache: kmalloc-192, object size: 192, buffer size: 192, default
>> order:
>>      1, min order: 0
>>        node 0: slabs: 6172, objs: 259224, free: 245741
>>        node 1: slabs: 3261, objs: 136962, free: 127656
>>    ==
>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
>> present, and never change it.
>>
>> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
>> ---
>>   arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>>   1 files changed, 30 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index ad3639a..d539ebc 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>   	apic_write(APIC_LVT1, value);
>>   }
>>   
>> +/*
>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>> + * Do not clear the mapping even if cpu hot removed.
>> + * */
>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>> +	[0 ... MAX_LOCAL_APIC - 1] = -1,
>> +};
> 
> 
> This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

> 
> As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.
> 
> But you can't create inifinit array.
> 
> If you can't allocate the array dynamically, How about adding
> 
>  static int cpuid_to_apicid[MAX_CPU] = {}
> 
> or using idr library ? (please see lib/idr.c)
> 
> I guess you can update this map after boot(after mm initialization)
> and make use of idr library.
> 
> About this patch, Nack.
> 
> -Kame
> 
> 
> 
>> +
>> +/*
>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>> + * */
>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>> +
>> +static int get_cpuid(int apicid)
>> +{
>> +	int cpuid;
>> +
>> +	cpuid = apicid_to_x86_cpu[apicid];
>> +	if (cpuid == -1)
>> +		cpuid = cpumask_next_zero(-1, &cpuid_mask);
>> +
>> +	return cpuid;
>> +}
>> +
>>   int generic_processor_info(int apicid, int version)
>>   {
>>   	int cpu, max = nr_cpu_ids;
>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>>   		 */
>>   		cpu = 0;
>>   	} else
>> -		cpu = cpumask_next_zero(-1, cpu_present_mask);
>> +		cpu = get_cpuid(apicid);
>> +
>> +	/* Store the mapping */
>> +	apicid_to_x86_cpu[apicid] = cpu;
>>   
>>   	/*
>>   	 * Validate version
>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>>   	early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>>   		apic->x86_32_early_logical_apicid(cpu);
>>   #endif
>> +	/* Mark this cpu id as uesed (already mapping a local apic id) */
>> +	cpumask_set_cpu(cpu, &cpuid_mask);
>>   	set_cpu_possible(cpu, true);
>>   	set_cpu_present(cpu, true);
>>   
>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-26  4:55     ` Gu Zheng
@ 2015-03-26 15:13       ` Tejun Heo
  2015-03-26 16:31       ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-03-26 15:13 UTC (permalink / raw)
  To: Gu Zheng
  Cc: Kamezawa Hiroyuki, linux-kernel, laijs, isimatu.yasuaki,
	tangchen, izumi.taku

On Thu, Mar 26, 2015 at 12:55:04PM +0800, Gu Zheng wrote:
> >> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
> >> +	[0 ... MAX_LOCAL_APIC - 1] = -1,
> >> +};
> > 
> > 
> > This patch cannot handle x2apic, which is 32bit.
> 
> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
> generating a logic cpu number for it, so it seems no problem here.

You're defining 128k array on 64bit most of which will be unused on
most machines.  There is a problem with that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-26  4:55     ` Gu Zheng
  2015-03-26 15:13       ` Tejun Heo
@ 2015-03-26 16:31       ` Kamezawa Hiroyuki
  2015-03-30  9:58         ` Gu Zheng
  1 sibling, 1 reply; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-03-26 16:31 UTC (permalink / raw)
  To: Gu Zheng; +Cc: linux-kernel, tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/03/26 13:55, Gu Zheng wrote:
> Hi Kame-san,
> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:
> 
>> On 2015/03/26 11:17, Gu Zheng wrote:
>>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
>>> the relationship will be changed if the cpu/node hotplug happenned, because we
>>> always choose the first free cpuid for the hot added cpu (whether it is new-add
>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>>> occurred, and it causes the wq sub-system allocation failture:
>>>     ==
>>>        SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>         cache: kmalloc-192, object size: 192, buffer size: 192, default
>>> order:
>>>       1, min order: 0
>>>         node 0: slabs: 6172, objs: 259224, free: 245741
>>>         node 1: slabs: 3261, objs: 136962, free: 127656
>>>     ==
>>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
>>> present, and never change it.
>>>
>>> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
>>> ---
>>>    arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>>>    1 files changed, 30 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>>> index ad3639a..d539ebc 100644
>>> --- a/arch/x86/kernel/apic/apic.c
>>> +++ b/arch/x86/kernel/apic/apic.c
>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>>    	apic_write(APIC_LVT1, value);
>>>    }
>>>    
>>> +/*
>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>>> + * Do not clear the mapping even if cpu hot removed.
>>> + * */
>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>>> +	[0 ... MAX_LOCAL_APIC - 1] = -1,
>>> +};
>>
>>
>> This patch cannot handle x2apic, which is 32bit.
> 
> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
> generating a logic cpu number for it, so it seems no problem here.
> 
you mean MAX_LOCAL_APIC=32768 ? ....isn't it too wasting ?

Anyway, APIC IDs are sparse values. Please use proper structure.

Thanks,
-Kame

>>
>> As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.
>>
>> But you can't create inifinit array.
>>
>> If you can't allocate the array dynamically, How about adding
>>
>>   static int cpuid_to_apicid[MAX_CPU] = {}
>>
>> or using idr library ? (please see lib/idr.c)
>>
>> I guess you can update this map after boot(after mm initialization)
>> and make use of idr library.
>>
>> About this patch, Nack.
>>
>> -Kame
>>
>>
>>
>>> +
>>> +/*
>>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>>> + * */
>>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>>> +
>>> +static int get_cpuid(int apicid)
>>> +{
>>> +	int cpuid;
>>> +
>>> +	cpuid = apicid_to_x86_cpu[apicid];
>>> +	if (cpuid == -1)
>>> +		cpuid = cpumask_next_zero(-1, &cpuid_mask);
>>> +
>>> +	return cpuid;
>>> +}
>>> +
>>>    int generic_processor_info(int apicid, int version)
>>>    {
>>>    	int cpu, max = nr_cpu_ids;
>>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>>>    		 */
>>>    		cpu = 0;
>>>    	} else
>>> -		cpu = cpumask_next_zero(-1, cpu_present_mask);
>>> +		cpu = get_cpuid(apicid);
>>> +
>>> +	/* Store the mapping */
>>> +	apicid_to_x86_cpu[apicid] = cpu;
>>>    
>>>    	/*
>>>    	 * Validate version
>>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>>>    	early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>>>    		apic->x86_32_early_logical_apicid(cpu);
>>>    #endif
>>> +	/* Mark this cpu id as uesed (already mapping a local apic id) */
>>> +	cpumask_set_cpu(cpu, &cpuid_mask);
>>>    	set_cpu_possible(cpu, true);
>>>    	set_cpu_present(cpu, true);
>>>    
>>>
>>
>>
>> .
>>
> 
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-26 16:31       ` Kamezawa Hiroyuki
@ 2015-03-30  9:58         ` Gu Zheng
  2015-04-01  2:59           ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 23+ messages in thread
From: Gu Zheng @ 2015-03-30  9:58 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hi Kame-san,

On 03/27/2015 12:31 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/26 13:55, Gu Zheng wrote:
>> Hi Kame-san,
>> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:
>>
>>> On 2015/03/26 11:17, Gu Zheng wrote:
>>>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
>>>> the relationship will be changed if the cpu/node hotplug happenned, because we
>>>> always choose the first free cpuid for the hot added cpu (whether it is new-add
>>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>>>> occurred, and it causes the wq sub-system allocation failture:
>>>>     ==
>>>>        SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>>         cache: kmalloc-192, object size: 192, buffer size: 192, default
>>>> order:
>>>>       1, min order: 0
>>>>         node 0: slabs: 6172, objs: 259224, free: 245741
>>>>         node 1: slabs: 3261, objs: 136962, free: 127656
>>>>     ==
>>>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
>>>> present, and never change it.
>>>>
>>>> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
>>>> ---
>>>>    arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>>>>    1 files changed, 30 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>>>> index ad3639a..d539ebc 100644
>>>> --- a/arch/x86/kernel/apic/apic.c
>>>> +++ b/arch/x86/kernel/apic/apic.c
>>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>>>    	apic_write(APIC_LVT1, value);
>>>>    }
>>>>    
>>>> +/*
>>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>>>> + * Do not clear the mapping even if cpu hot removed.
>>>> + * */
>>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>>>> +	[0 ... MAX_LOCAL_APIC - 1] = -1,
>>>> +};
>>>
>>>
>>> This patch cannot handle x2apic, which is 32bit.
>>
>> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
>> generating a logic cpu number for it, so it seems no problem here.
>>
> you mean MAX_LOCAL_APIC=32768 ? ....isn't it too wasting ?

I use the big array here to keep the same format with the existed ones:
int apic_version[MAX_LOCAL_APIC];

s16 __apicid_to_node[MAX_LOCAL_APIC] = {
	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
};
Or we should also say "NO" to them?

Regards,
Gu

> 
> Anyway, APIC IDs are sparse values. Please use proper structure.
> 
> Thanks,
> -Kame
> 
>>>
>>> As far as I understand, it depends on CPU's spec and the newest cpu has 9bit apicid, at least.
>>>
>>> But you can't create inifinit array.
>>>
>>> If you can't allocate the array dynamically, How about adding
>>>
>>>   static int cpuid_to_apicid[MAX_CPU] = {}
>>>
>>> or using idr library ? (please see lib/idr.c)
>>>
>>> I guess you can update this map after boot(after mm initialization)
>>> and make use of idr library.
>>>
>>> About this patch, Nack.
>>>
>>> -Kame
>>>
>>>
>>>
>>>> +
>>>> +/*
>>>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>>>> + * */
>>>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>>>> +
>>>> +static int get_cpuid(int apicid)
>>>> +{
>>>> +	int cpuid;
>>>> +
>>>> +	cpuid = apicid_to_x86_cpu[apicid];
>>>> +	if (cpuid == -1)
>>>> +		cpuid = cpumask_next_zero(-1, &cpuid_mask);
>>>> +
>>>> +	return cpuid;
>>>> +}
>>>> +
>>>>    int generic_processor_info(int apicid, int version)
>>>>    {
>>>>    	int cpu, max = nr_cpu_ids;
>>>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>>>>    		 */
>>>>    		cpu = 0;
>>>>    	} else
>>>> -		cpu = cpumask_next_zero(-1, cpu_present_mask);
>>>> +		cpu = get_cpuid(apicid);
>>>> +
>>>> +	/* Store the mapping */
>>>> +	apicid_to_x86_cpu[apicid] = cpu;
>>>>    
>>>>    	/*
>>>>    	 * Validate version
>>>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>>>>    	early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>>>>    		apic->x86_32_early_logical_apicid(cpu);
>>>>    #endif
>>>> +	/* Mark this cpu id as uesed (already mapping a local apic id) */
>>>> +	cpumask_set_cpu(cpu, &cpuid_mask);
>>>>    	set_cpu_possible(cpu, true);
>>>>    	set_cpu_present(cpu, true);
>>>>    
>>>>
>>>
>>>
>>> .
>>>
>>
>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent
  2015-03-30  9:58         ` Gu Zheng
@ 2015-04-01  2:59           ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-04-01  2:59 UTC (permalink / raw)
  To: Gu Zheng; +Cc: linux-kernel, tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/03/30 18:58, Gu Zheng wrote:
> Hi Kame-san,
> 
> On 03/27/2015 12:31 AM, Kamezawa Hiroyuki wrote:
> 
>> On 2015/03/26 13:55, Gu Zheng wrote:
>>> Hi Kame-san,
>>> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:
>>>
>>>> On 2015/03/26 11:17, Gu Zheng wrote:
>>>>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
>>>>> the relationship will be changed if the cpu/node hotplug happenned, because we
>>>>> always choose the first free cpuid for the hot added cpu (whether it is new-add
>>>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>>>>> occurred, and it causes the wq sub-system allocation failture:
>>>>>      ==
>>>>>         SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>>>          cache: kmalloc-192, object size: 192, buffer size: 192, default
>>>>> order:
>>>>>        1, min order: 0
>>>>>          node 0: slabs: 6172, objs: 259224, free: 245741
>>>>>          node 1: slabs: 3261, objs: 136962, free: 127656
>>>>>      ==
>>>>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
>>>>> present, and never change it.
>>>>>
>>>>> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>>>> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
>>>>> ---
>>>>>     arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>>>>>     1 files changed, 30 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>>>>> index ad3639a..d539ebc 100644
>>>>> --- a/arch/x86/kernel/apic/apic.c
>>>>> +++ b/arch/x86/kernel/apic/apic.c
>>>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>>>>     	apic_write(APIC_LVT1, value);
>>>>>     }
>>>>>     
>>>>> +/*
>>>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>>>>> + * Do not clear the mapping even if cpu hot removed.
>>>>> + * */
>>>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>>>>> +	[0 ... MAX_LOCAL_APIC - 1] = -1,
>>>>> +};
>>>>
>>>>
>>>> This patch cannot handle x2apic, which is 32bit.
>>>
>>> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
>>> generating a logic cpu number for it, so it seems no problem here.
>>>
>> you mean MAX_LOCAL_APIC=32768 ? ....isn't it too wasting ?
> 
> I use the big array here to keep the same format with the existed ones:
> int apic_version[MAX_LOCAL_APIC];
> 
> s16 __apicid_to_node[MAX_LOCAL_APIC] = {
> 	[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
> };
> Or we should also say "NO" to them?

This will not survive when someone try to enlarge MAX_LOCAL_APIC for
introdcing new cpu model because x2apic is 32bit by definition.

Please create long surviving infrastructure.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online
  2015-03-26  2:17 [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Gu Zheng
  2015-03-26  2:17 ` [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent Gu Zheng
@ 2015-03-26  2:17 ` Gu Zheng
  2015-03-26  3:12 ` [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Kamezawa Hiroyuki
  2 siblings, 0 replies; 23+ messages in thread
From: Gu Zheng @ 2015-03-26  2:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: tj, laijs, isimatu.yasuaki, kamezawa.hiroyu, tangchen, guz.fnst,
	izumi.taku

Update the per cpu workqueue's numa affinity when cpu preparing online to
create the worker on the correct node.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
---
 kernel/workqueue.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 41ff75b..4c65953 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4618,6 +4618,7 @@ static int workqueue_cpu_up_callback(struct notifier_block *nfb,
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_UP_PREPARE:
 		for_each_cpu_worker_pool(pool, cpu) {
+			pool->node = cpu_to_node(cpu);
 			if (pool->nr_workers)
 				continue;
 			if (!create_worker(pool))
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-26  2:17 [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Gu Zheng
  2015-03-26  2:17 ` [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent Gu Zheng
  2015-03-26  2:17 ` [PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online Gu Zheng
@ 2015-03-26  3:12 ` Kamezawa Hiroyuki
  2015-03-26  5:04   ` Gu Zheng
  2 siblings, 1 reply; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-03-26  3:12 UTC (permalink / raw)
  To: Gu Zheng, linux-kernel; +Cc: tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/03/26 11:17, Gu Zheng wrote:
> Yasuaki Ishimatsu found that with node online/offline, cpu<->node
> relationship is established. Because workqueue uses a info which was
> established at boot time, but it may be changed by node hotpluging.
> 
> Once pool->node points to a stale node, following allocation failure
> happens.
>    ==
>       SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>        cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
>      1, min order: 0
>        node 0: slabs: 6172, objs: 259224, free: 245741
>        node 1: slabs: 3261, objs: 136962, free: 127656
>    ==
> 
> As the apicid <--> node relationship is persistent, so the root cause is the
　　　　 ^^^^^^^
           pxm.

> cpu-id <-> lapicid mapping is not persistent (because the currently implementation
> always choose the first free cpu id for the new added cpu), so if we can build
> persistent cpu-id <-> lapicid relationship, this problem will be fixed.
> 
> Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.
> 
> Gu Zheng (2):
>    x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
>    workqueue: update per cpu workqueue's numa affinity when cpu
>      preparing online

why patch(2/2) required ?

Thanks,
-Kame

> 
>   arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>   kernel/workqueue.c          |    1 +
>   2 files changed, 31 insertions(+), 1 deletions(-)
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-26  3:12 ` [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Kamezawa Hiroyuki
@ 2015-03-26  5:04   ` Gu Zheng
  2015-03-26 15:18     ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Gu Zheng @ 2015-03-26  5:04 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, tj, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hi Kame-san,

On 03/26/2015 11:12 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/26 11:17, Gu Zheng wrote:
>> Yasuaki Ishimatsu found that with node online/offline, cpu<->node
>> relationship is established. Because workqueue uses a info which was
>> established at boot time, but it may be changed by node hotpluging.
>>
>> Once pool->node points to a stale node, following allocation failure
>> happens.
>>    ==
>>       SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>        cache: kmalloc-192, object size: 192, buffer size: 192, default
>> order:
>>      1, min order: 0
>>        node 0: slabs: 6172, objs: 259224, free: 245741
>>        node 1: slabs: 3261, objs: 136962, free: 127656
>>    ==
>>
>> As the apicid <--> node relationship is persistent, so the root cause is the
> 　　　　 ^^^^^^^
>            pxm.
> 
>> cpu-id <-> lapicid mapping is not persistent (because the currently implementation
>> always choose the first free cpu id for the new added cpu), so if we can build
>> persistent cpu-id <-> lapicid relationship, this problem will be fixed.
>>
>> Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.
>>
>> Gu Zheng (2):
>>    x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
>>    workqueue: update per cpu workqueue's numa affinity when cpu
>>      preparing online
> 
> why patch(2/2) required ?

wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added cpu
to the correct node when preparing online, otherwise it will try to create worker
on invalid node if node hotplug occurred.

Regards,
Gu

> 
> Thanks,
> -Kame
> 
>>
>>   arch/x86/kernel/apic/apic.c |   31 ++++++++++++++++++++++++++++++-
>>   kernel/workqueue.c          |    1 +
>>   2 files changed, 31 insertions(+), 1 deletions(-)
>>
> 
> 
> .
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-26  5:04   ` Gu Zheng
@ 2015-03-26 15:18     ` Tejun Heo
  2015-03-26 16:42       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-03-26 15:18 UTC (permalink / raw)
  To: Gu Zheng
  Cc: Kamezawa Hiroyuki, linux-kernel, laijs, isimatu.yasuaki,
	tangchen, izumi.taku

Hello,

On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
> wq generates the numa affinity (pool->node) for all the possible cpu's
> per cpu workqueue at init stage, that means the affinity of currently un-present
> ones' may be incorrect, so we need to update the pool->node for the new added cpu
> to the correct node when preparing online, otherwise it will try to create worker
> on invalid node if node hotplug occurred.

If the mapping is gonna be static once the cpus show up, any chance we
can initialize that for all possible cpus during boot?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-26 15:18     ` Tejun Heo
@ 2015-03-26 16:42       ` Kamezawa Hiroyuki
  2015-03-30  9:49         ` Gu Zheng
  2015-03-30  9:49         ` Gu Zheng
  0 siblings, 2 replies; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-03-26 16:42 UTC (permalink / raw)
  To: Tejun Heo, Gu Zheng
  Cc: linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/03/27 0:18, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
>> wq generates the numa affinity (pool->node) for all the possible cpu's
>> per cpu workqueue at init stage, that means the affinity of currently un-present
>> ones' may be incorrect, so we need to update the pool->node for the new added cpu
>> to the correct node when preparing online, otherwise it will try to create worker
>> on invalid node if node hotplug occurred.
>
> If the mapping is gonna be static once the cpus show up, any chance we
> can initialize that for all possible cpus during boot?
>

I think the kernel can define all possible

  cpuid <-> lapicid <-> pxm <-> nodeid

mapping at boot with using firmware table information.

One concern is current x86 logic for memory-less node v.s. memory hotplug.
(as I explained before)

My idea is
   step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node id at boot.

But this may be overwritten by x86's memory less node logic. So,
   step2. check node is online or not before calling kmalloc. If offline, use -1.
          rather than updating workqueue's attribute.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-26 16:42       ` Kamezawa Hiroyuki
@ 2015-03-30  9:49         ` Gu Zheng
  2015-03-30  9:49         ` Gu Zheng
  1 sibling, 0 replies; 23+ messages in thread
From: Gu Zheng @ 2015-03-30  9:49 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Tejun Heo, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/27 0:18, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
>>> wq generates the numa affinity (pool->node) for all the possible cpu's
>>> per cpu workqueue at init stage, that means the affinity of currently un-present
>>> ones' may be incorrect, so we need to update the pool->node for the new added cpu
>>> to the correct node when preparing online, otherwise it will try to create worker
>>> on invalid node if node hotplug occurred.
>>
>> If the mapping is gonna be static once the cpus show up, any chance we
>> can initialize that for all possible cpus during boot?
>>
> 
> I think the kernel can define all possible
> 
>  cpuid <-> lapicid <-> pxm <-> nodeid
> 
> mapping at boot with using firmware table information.

Could you explain more?

> 
> One concern is current x86 logic for memory-less node v.s. memory hotplug.
> (as I explained before)
> 
> My idea is
>   step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node id at boot.
> 
> But this may be overwritten by x86's memory less node logic. So,
>   step2. check node is online or not before calling kmalloc. If offline, use -1.
>          rather than updating workqueue's attribute.
> 
> Thanks,
> -Kame
> 
> .
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-26 16:42       ` Kamezawa Hiroyuki
  2015-03-30  9:49         ` Gu Zheng
@ 2015-03-30  9:49         ` Gu Zheng
  2015-03-31  6:09           ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 23+ messages in thread
From: Gu Zheng @ 2015-03-30  9:49 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Tejun Heo, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/27 0:18, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
>>> wq generates the numa affinity (pool->node) for all the possible cpu's
>>> per cpu workqueue at init stage, that means the affinity of currently un-present
>>> ones' may be incorrect, so we need to update the pool->node for the new added cpu
>>> to the correct node when preparing online, otherwise it will try to create worker
>>> on invalid node if node hotplug occurred.
>>
>> If the mapping is gonna be static once the cpus show up, any chance we
>> can initialize that for all possible cpus during boot?
>>
> 
> I think the kernel can define all possible
> 
>  cpuid <-> lapicid <-> pxm <-> nodeid
> 
> mapping at boot with using firmware table information.

Could you explain more?

Regards,
Gu

> 
> One concern is current x86 logic for memory-less node v.s. memory hotplug.
> (as I explained before)
> 
> My idea is
>   step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node id at boot.
> 
> But this may be overwritten by x86's memory less node logic. So,
>   step2. check node is online or not before calling kmalloc. If offline, use -1.
>          rather than updating workqueue's attribute.
> 
> Thanks,
> -Kame
> 
> .
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-30  9:49         ` Gu Zheng
@ 2015-03-31  6:09           ` Kamezawa Hiroyuki
  2015-03-31 15:28             ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-03-31  6:09 UTC (permalink / raw)
  To: Gu Zheng
  Cc: Tejun Heo, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/03/30 18:49, Gu Zheng wrote:
> Hi Kame-san,
>
> On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:
>
>> On 2015/03/27 0:18, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
>>>> wq generates the numa affinity (pool->node) for all the possible cpu's
>>>> per cpu workqueue at init stage, that means the affinity of currently un-present
>>>> ones' may be incorrect, so we need to update the pool->node for the new added cpu
>>>> to the correct node when preparing online, otherwise it will try to create worker
>>>> on invalid node if node hotplug occurred.
>>>
>>> If the mapping is gonna be static once the cpus show up, any chance we
>>> can initialize that for all possible cpus during boot?
>>>
>>
>> I think the kernel can define all possible
>>
>>   cpuid <-> lapicid <-> pxm <-> nodeid
>>
>> mapping at boot with using firmware table information.
>
> Could you explain more?

you can scan firmware's table and store all provided information for possible cpus/pxms
somewhere even while future cpus/mems are not onlined rather than determine them
on demand.
But this may be considered as API change for most hot-add users.

So, for now, I vote for detemining ids at online but record it is a good way.

Thanks,
-Kame











^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-31  6:09           ` Kamezawa Hiroyuki
@ 2015-03-31 15:28             ` Tejun Heo
  2015-04-01  2:55               ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2015-03-31 15:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Gu Zheng, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hello, Kamezawa.

On Tue, Mar 31, 2015 at 03:09:05PM +0900, Kamezawa Hiroyuki wrote:
> But this may be considered as API change for most hot-add users.

Hmm... Why would it be?  What can that possibly break?

> So, for now, I vote for detemining ids at online but record it is a good way.

If we know the information during boot, let's please do it at boot
time by all means.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-03-31 15:28             ` Tejun Heo
@ 2015-04-01  2:55               ` Kamezawa Hiroyuki
  2015-04-01  3:02                 ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-04-01  2:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Gu Zheng, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/04/01 0:28, Tejun Heo wrote:
> Hello, Kamezawa.
>
> On Tue, Mar 31, 2015 at 03:09:05PM +0900, Kamezawa Hiroyuki wrote:
>> But this may be considered as API change for most hot-add users.
>
> Hmm... Why would it be?  What can that possibly break?
>

Now, hot-added cpus will have the lowest free cpu id.

Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
contiguous even after cpu hot add.
In enterprise, this would be considered as imcompatibility.

determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
exisiting script or configuration/resource management software.

>> So, for now, I vote for detemining ids at online but record it is a good way.
>
> If we know the information during boot, let's please do it at boot
> time by all means.

I basically agree. Just thinking influence of this small imcompatibility of forcing
ideal way because of my standing point.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-04-01  2:55               ` Kamezawa Hiroyuki
@ 2015-04-01  3:02                 ` Tejun Heo
  2015-04-01  3:05                   ` Tejun Heo
  2015-04-01  8:30                   ` Kamezawa Hiroyuki
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2015-04-01  3:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Gu Zheng, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:
> Now, hot-added cpus will have the lowest free cpu id.
> 
> Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
> contiguous even after cpu hot add.
> In enterprise, this would be considered as imcompatibility.
> 
> determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
> exisiting script or configuration/resource management software.

Ugh... so, cpu number allocation on hot-add is part of userland
interface that we're locked into?  Tying hotplug and id allocation
order together usually isn't a good idea.  What if the cpu up fails
while running the notifiers?  The ID is already allocated and the next
cpu being brought up will be after a hole anyway.  Is this even
actually gonna affect userland?

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-04-01  3:02                 ` Tejun Heo
@ 2015-04-01  3:05                   ` Tejun Heo
  2015-04-01  8:30                   ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2015-04-01  3:05 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Gu Zheng, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On Tue, Mar 31, 2015 at 11:02:42PM -0400, Tejun Heo wrote:
> Ugh... so, cpu number allocation on hot-add is part of userland
> interface that we're locked into?  Tying hotplug and id allocation
> order together usually isn't a good idea.  What if the cpu up fails
> while running the notifiers?  The ID is already allocated and the next
> cpu being brought up will be after a hole anyway.  Is this even
> actually gonna affect userland?

At any rate, this isn't big a deal.  If the mapping has to be dynamic,
it'd be great to move the mapping code from workqueue to cpu hotplug
proper tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-04-01  3:02                 ` Tejun Heo
  2015-04-01  3:05                   ` Tejun Heo
@ 2015-04-01  8:30                   ` Kamezawa Hiroyuki
  2015-04-02  1:36                     ` Gu Zheng
  1 sibling, 1 reply; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-04-01  8:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Gu Zheng, linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/04/01 12:02, Tejun Heo wrote:
> On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:
>> Now, hot-added cpus will have the lowest free cpu id.
>>
>> Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
>> contiguous even after cpu hot add.
>> In enterprise, this would be considered as imcompatibility.
>>
>> determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
>> exisiting script or configuration/resource management software.
>
> Ugh... so, cpu number allocation on hot-add is part of userland
> interface that we're locked into?

We checked most of RHEL7 packages and didn't find a problem yet.
But, for examle, we know some performance test team's test program assumed contiguous
cpuids and it failed. It was an easy case because we can ask them to fix the application
but I guess there will be some amount of customers that cpuids are contiguous.

> Tying hotplug and id allocation
> order together usually isn't a good idea.  What if the cpu up fails
> while running the notifiers?  The ID is already allocated and the next
> cpu being brought up will be after a hole anyway.  Is this even
> actually gonna affect userland?
>

Maybe. It's not fail-safe but....

In general, all kernel engineers (and skilled userland engineers) knows that
cpuids cannot be always contiguous and cpuids/nodeids should be checked before
running programs. I think most of engineers should be aware of that but many
users have their own assumption :(

Basically, I don't have strong objections, you're right technically.

In summary...
  - users should not assume cpuids are contiguous.
  - all possible ids should be fixed at boot time.
  - For uses, some clarification document should be somewhere in Documenatation.

So, Gu-san
  1) determine all possible ids at boot.
  2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
  3) It would be good if other guys give us ack.

In future,
I myself thinks naming system like udev for cpuid/numaid is necessary, at last.
Can that renaming feature can be cgroup/namespace feature ? If possible,
all container can have cpuids starting from 0.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-04-01  8:30                   ` Kamezawa Hiroyuki
@ 2015-04-02  1:36                     ` Gu Zheng
  2015-04-02  2:54                       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 23+ messages in thread
From: Gu Zheng @ 2015-04-02  1:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Tejun Heo
  Cc: linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

Hi Kame, TJ,

On 04/01/2015 04:30 PM, Kamezawa Hiroyuki wrote:

> On 2015/04/01 12:02, Tejun Heo wrote:
>> On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:
>>> Now, hot-added cpus will have the lowest free cpu id.
>>>
>>> Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
>>> contiguous even after cpu hot add.
>>> In enterprise, this would be considered as imcompatibility.
>>>
>>> determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
>>> exisiting script or configuration/resource management software.
>>
>> Ugh... so, cpu number allocation on hot-add is part of userland
>> interface that we're locked into?
> 
> We checked most of RHEL7 packages and didn't find a problem yet.
> But, for examle, we know some performance test team's test program assumed contiguous
> cpuids and it failed. It was an easy case because we can ask them to fix the application
> but I guess there will be some amount of customers that cpuids are contiguous.
> 
>> Tying hotplug and id allocation
>> order together usually isn't a good idea.  What if the cpu up fails
>> while running the notifiers?  The ID is already allocated and the next
>> cpu being brought up will be after a hole anyway.  Is this even
>> actually gonna affect userland?
>>
> 
> Maybe. It's not fail-safe but....
> 
> In general, all kernel engineers (and skilled userland engineers) knows that
> cpuids cannot be always contiguous and cpuids/nodeids should be checked before
> running programs. I think most of engineers should be aware of that but many
> users have their own assumption :(
> 
> Basically, I don't have strong objections, you're right technically.
> 
> In summary...
>  - users should not assume cpuids are contiguous.
>  - all possible ids should be fixed at boot time.
>  - For uses, some clarification document should be somewhere in Documenatation.

Fine to me.

> 
> So, Gu-san
>  1) determine all possible ids at boot.
>  2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
>  3) It would be good if other guys give us ack.

Also fine.
But before this going, could you please reconsider determining the ids when firstly
present (the implementation on this patchset)? 
Though it is not the perfect one in some words, but we can ignore the doubts that
mentioned above as the cpu/node hotplug is not frequent behaviours, and there seems
not anything harmful to us if we go this way.

Regards,
Gu

> 
> In future,
> I myself thinks naming system like udev for cpuid/numaid is necessary, at last.
> Can that renaming feature can be cgroup/namespace feature ? If possible,
> all container can have cpuids starting from 0.
> 
> Thanks,
> -Kame
> 
> .
> 



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed
  2015-04-02  1:36                     ` Gu Zheng
@ 2015-04-02  2:54                       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 23+ messages in thread
From: Kamezawa Hiroyuki @ 2015-04-02  2:54 UTC (permalink / raw)
  To: Gu Zheng, Tejun Heo
  Cc: linux-kernel, laijs, isimatu.yasuaki, tangchen, izumi.taku

On 2015/04/02 10:36, Gu Zheng wrote:
> Hi Kame, TJ,
>
> On 04/01/2015 04:30 PM, Kamezawa Hiroyuki wrote:
>
>> On 2015/04/01 12:02, Tejun Heo wrote:
>>> On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:
>>>> Now, hot-added cpus will have the lowest free cpu id.
>>>>
>>>> Because of this, in most of systems which has only cpu-hot-add, cpu-ids are always
>>>> contiguous even after cpu hot add.
>>>> In enterprise, this would be considered as imcompatibility.
>>>>
>>>> determining cpuid <-> lapicid at boot will make cpuids sparse. That may corrupt
>>>> exisiting script or configuration/resource management software.
>>>
>>> Ugh... so, cpu number allocation on hot-add is part of userland
>>> interface that we're locked into?
>>
>> We checked most of RHEL7 packages and didn't find a problem yet.
>> But, for examle, we know some performance test team's test program assumed contiguous
>> cpuids and it failed. It was an easy case because we can ask them to fix the application
>> but I guess there will be some amount of customers that cpuids are contiguous.
>>
>>> Tying hotplug and id allocation
>>> order together usually isn't a good idea.  What if the cpu up fails
>>> while running the notifiers?  The ID is already allocated and the next
>>> cpu being brought up will be after a hole anyway.  Is this even
>>> actually gonna affect userland?
>>>
>>
>> Maybe. It's not fail-safe but....
>>
>> In general, all kernel engineers (and skilled userland engineers) knows that
>> cpuids cannot be always contiguous and cpuids/nodeids should be checked before
>> running programs. I think most of engineers should be aware of that but many
>> users have their own assumption :(
>>
>> Basically, I don't have strong objections, you're right technically.
>>
>> In summary...
>>   - users should not assume cpuids are contiguous.
>>   - all possible ids should be fixed at boot time.
>>   - For uses, some clarification document should be somewhere in Documenatation.
>
> Fine to me.
>
>>
>> So, Gu-san
>>   1) determine all possible ids at boot.
>>   2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
>>   3) It would be good if other guys give us ack.
>
> Also fine.
> But before this going, could you please reconsider determining the ids when firstly
> present (the implementation on this patchset)?
> Though it is not the perfect one in some words, but we can ignore the doubts that
> mentioned above as the cpu/node hotplug is not frequent behaviours, and there seems
> not anything harmful to us if we go this way.
>

Is it so heavy work ?  Hmm. My requests are

Implement your patches as
- Please don't change current behavior at boot.
- Remember all possible apicids and give them future cpuids if not assigned.
as step 1.

Please fix dynamic pxm<->node detection in step2.

In future, memory-less node handling in x86 should be revisited.

Thanks,
-Kame








^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-04-02  2:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26  2:17 [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Gu Zheng
2015-03-26  2:17 ` [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent Gu Zheng
2015-03-26  3:19   ` Kamezawa Hiroyuki
2015-03-26  4:55     ` Gu Zheng
2015-03-26 15:13       ` Tejun Heo
2015-03-26 16:31       ` Kamezawa Hiroyuki
2015-03-30  9:58         ` Gu Zheng
2015-04-01  2:59           ` Kamezawa Hiroyuki
2015-03-26  2:17 ` [PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online Gu Zheng
2015-03-26  3:12 ` [PATCH 0/2] workqueue: fix a bug when numa mapping is changed Kamezawa Hiroyuki
2015-03-26  5:04   ` Gu Zheng
2015-03-26 15:18     ` Tejun Heo
2015-03-26 16:42       ` Kamezawa Hiroyuki
2015-03-30  9:49         ` Gu Zheng
2015-03-30  9:49         ` Gu Zheng
2015-03-31  6:09           ` Kamezawa Hiroyuki
2015-03-31 15:28             ` Tejun Heo
2015-04-01  2:55               ` Kamezawa Hiroyuki
2015-04-01  3:02                 ` Tejun Heo
2015-04-01  3:05                   ` Tejun Heo
2015-04-01  8:30                   ` Kamezawa Hiroyuki
2015-04-02  1:36                     ` Gu Zheng
2015-04-02  2:54                       ` Kamezawa Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.