* [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning @ 2016-07-27 12:54 Heiko Carstens 2016-07-27 15:23 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Heiko Carstens @ 2016-07-27 12:54 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner Hi Peter, I get the following warning on s390 when using fake NUMA beginning with your patch e9d867a67fd0 "sched: Allow per-cpu kernel threads to run on online && !active" [ 3.162909] WARNING: CPU: 0 PID: 1 at include/linux/cpumask.h:121 select_task_rq+0xe6/0x1a8 [ 3.162911] Modules linked in: [ 3.162914] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc6-00001-ge9d867a67fd0-dirty #28 [ 3.162917] task: 00000001dd270008 ti: 00000001eccb4000 task.ti: 00000001eccb4000 [ 3.162918] Krnl PSW : 0404c00180000000 0000000000176c56 (select_task_rq+0xe6/0x1a8) [ 3.162923] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 00000000009f3d00 00000001dd270008 0000000000000100 00000000000000f4 [ 3.162927] 0000000000eaf4e2 0000000000000000 00000001eccb7bc0 0400000000000001 [ 3.162929] 00000001ec660950 0000000000000000 00000001ec660520 00000001ec660008 [ 3.162930] 0000000000000100 000000000099c1a0 0000000000176c30 00000001eccb7a70 [ 3.162940] Krnl Code: 0000000000176c4a: a774002f brc 7,176ca8 0000000000176c4e: 92014000 mvi 0(%r4),1 #0000000000176c52: a7f40001 brc 15,176c54 >0000000000176c56: a7f40029 brc 15,176ca8 0000000000176c5a: 95004000 cli 0(%r4),0 0000000000176c5e: a7740006 brc 7,176c6a 0000000000176c62: 92014000 mvi 0(%r4),1 0000000000176c66: a7f40001 brc 15,176c68 [ 3.162958] Call Trace: [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8) [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478) [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0) [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438) [ 3.162973] ([<0000000000162550>] apply_wqattrs_prepare+0x200/0x2a0) [ 3.162975] ([<000000000016266a>] apply_workqueue_attrs_locked+0x7a/0xb0) [ 3.162977] ([<0000000000162af0>] apply_workqueue_attrs+0x50/0x78) [ 3.162979] ([<000000000016441c>] __alloc_workqueue_key+0x304/0x520) [ 3.162983] ([<0000000000ee3706>] default_bdi_init+0x3e/0x70) [ 3.162986] ([<0000000000100270>] do_one_initcall+0x140/0x1d8) [ 3.162990] ([<0000000000ec9da8>] kernel_init_freeable+0x220/0x2d8) [ 3.162993] ([<0000000000984a7a>] kernel_init+0x2a/0x150) [ 3.162996] ([<00000000009913fa>] kernel_thread_starter+0x6/0xc) [ 3.162998] ([<00000000009913f4>] kernel_thread_starter+0x0/0xc) [ 3.163000] 4 locks held by swapper/0/1: [ 3.163002] #0: (cpu_hotplug.lock){++++++}, at: [<000000000013ebe0>] get_online_cpus+0x48/0xb8 [ 3.163010] #1: (wq_pool_mutex){+.+.+.}, at: [<0000000000162ae2>] apply_workqueue_attrs+0x42/0x78 [ 3.163016] #2: (&pool->lock/1){......}, at: [<000000000015d44a>] create_worker+0x152/0x1c0 [ 3.163022] #3: (&p->pi_lock){..-...}, at: [<0000000000177ac4>] try_to_wake_up+0x44/0x478 [ 3.163028] Last Breaking-Event-Address: [ 3.163030] [<0000000000176c52>] select_task_rq+0xe2/0x1a8 For some unknown reason select_task_rq() gets called with a task that has nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));" within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the warning later on. It only happens with more than one node, otherwise it seems to work fine. Any idea what could be wrong here? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-07-27 12:54 [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning Heiko Carstens @ 2016-07-27 15:23 ` Thomas Gleixner 2016-07-30 11:25 ` Heiko Carstens 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2016-07-27 15:23 UTC (permalink / raw) To: Heiko Carstens; +Cc: Peter Zijlstra, LKML, Tejun Heo On Wed, 27 Jul 2016, Heiko Carstens wrote: > [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8) > [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478) > [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0) > [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438) > For some unknown reason select_task_rq() gets called with a task that has > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));" > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the > warning later on. > > It only happens with more than one node, otherwise it seems to work fine. > > Any idea what could be wrong here? create_worker() tsk = kthread_create_on_node(); kthread_bind_mask(tsk, pool->attrs->cpumask); do_set_cpus_allowed(tsk, mask); set_cpus_allowed_common(tsk, mask); cpumask_copy(&tsk->cpus_allowed, mask); tsk->nr_cpus_allowed = cpumask_weight(mask); wake_up_process(task); So this looks like pool->attrs->cpumask is simply empty..... Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-07-27 15:23 ` Thomas Gleixner @ 2016-07-30 11:25 ` Heiko Carstens 2016-08-08 7:45 ` Ming Lei 0 siblings, 1 reply; 16+ messages in thread From: Heiko Carstens @ 2016-07-30 11:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Peter Zijlstra, LKML, Tejun Heo On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote: > On Wed, 27 Jul 2016, Heiko Carstens wrote: > > [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8) > > [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478) > > [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0) > > [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438) > > > For some unknown reason select_task_rq() gets called with a task that has > > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));" > > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the > > warning later on. > > > > It only happens with more than one node, otherwise it seems to work fine. > > > > Any idea what could be wrong here? > > create_worker() > tsk = kthread_create_on_node(); > kthread_bind_mask(tsk, pool->attrs->cpumask); > do_set_cpus_allowed(tsk, mask); > set_cpus_allowed_common(tsk, mask); > cpumask_copy(&tsk->cpus_allowed, mask); > tsk->nr_cpus_allowed = cpumask_weight(mask); > wake_up_process(task); > > So this looks like pool->attrs->cpumask is simply empty..... Just had some time to look into this a bit more. Looks like we initialize the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's patch just revealed that problem. I'll see if initializing the masks earlier will fix this, but I think it will. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-07-30 11:25 ` Heiko Carstens @ 2016-08-08 7:45 ` Ming Lei 2016-08-15 11:19 ` Heiko Carstens 0 siblings, 1 reply; 16+ messages in thread From: Ming Lei @ 2016-08-08 7:45 UTC (permalink / raw) To: Heiko Carstens; +Cc: Thomas Gleixner, Peter Zijlstra, LKML, Tejun Heo On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens <heiko.carstens@de.ibm.com> wrote: > On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote: >> On Wed, 27 Jul 2016, Heiko Carstens wrote: >> > [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8) >> > [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478) >> > [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0) >> > [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438) >> >> > For some unknown reason select_task_rq() gets called with a task that has >> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));" >> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the >> > warning later on. >> > >> > It only happens with more than one node, otherwise it seems to work fine. >> > >> > Any idea what could be wrong here? >> >> create_worker() >> tsk = kthread_create_on_node(); >> kthread_bind_mask(tsk, pool->attrs->cpumask); >> do_set_cpus_allowed(tsk, mask); >> set_cpus_allowed_common(tsk, mask); >> cpumask_copy(&tsk->cpus_allowed, mask); >> tsk->nr_cpus_allowed = cpumask_weight(mask); >> wake_up_process(task); >> >> So this looks like pool->attrs->cpumask is simply empty..... > > Just had some time to look into this a bit more. Looks like we initialize > the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's > patch just revealed that problem. > > I'll see if initializing the masks earlier will fix this, but I think it > will. Hello, Is there any fix for this issue? I can see the issue on arm64 running v4.7 kernel too. And the oops can be avoided by reverting commit e9d867a(sched: Allow per-cpu kernel threads to run on online && !active). Thanks, Ming Lei ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-08 7:45 ` Ming Lei @ 2016-08-15 11:19 ` Heiko Carstens 2016-08-15 22:48 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Heiko Carstens @ 2016-08-15 11:19 UTC (permalink / raw) To: Ming Lei, Tejun Heo Cc: Thomas Gleixner, Peter Zijlstra, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky On Mon, Aug 08, 2016 at 03:45:05PM +0800, Ming Lei wrote: > On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens > <heiko.carstens@de.ibm.com> wrote: > > On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote: > >> On Wed, 27 Jul 2016, Heiko Carstens wrote: > >> > [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8) > >> > [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478) > >> > [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0) > >> > [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438) > >> > >> > For some unknown reason select_task_rq() gets called with a task that has > >> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));" > >> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the > >> > warning later on. > >> > > >> > It only happens with more than one node, otherwise it seems to work fine. > >> > > >> > Any idea what could be wrong here? > >> > >> create_worker() > >> tsk = kthread_create_on_node(); > >> kthread_bind_mask(tsk, pool->attrs->cpumask); > >> do_set_cpus_allowed(tsk, mask); > >> set_cpus_allowed_common(tsk, mask); > >> cpumask_copy(&tsk->cpus_allowed, mask); > >> tsk->nr_cpus_allowed = cpumask_weight(mask); > >> wake_up_process(task); > >> > >> So this looks like pool->attrs->cpumask is simply empty..... > > > > Just had some time to look into this a bit more. Looks like we initialize > > the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's > > patch just revealed that problem. > > > > I'll see if initializing the masks earlier will fix this, but I think it > > will. > > Hello, > > Is there any fix for this issue? I can see the issue on arm64 running > v4.7 kernel too. And the oops can be avoided by reverting commit > e9d867a(sched: Allow per-cpu kernel threads to run on online && !active). I don't know about the arm64 issue. The s390 problem is a result from initializing the cpu_to_node mapping too late. However, the workqueue code seems to assume that we know the cpu_to_node mapping for all _possible_ cpus very early and apparently it assumes that this mapping is stable and doesn't change anymore. This assumption however contradicts the purpose of 346404682434 ("numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU"). So something is wrong here... On s390 with fake numa we wouldn't even know the mapping of all _possible_ cpus at boot time. When establishing the node mapping we try hard to map our existing cpu topology into a sane node mapping. However we simply don't know where non-present cpus are located topology-wise. Even for present cpus the answer is not always there since present cpus can be in either the state "configured" (topology location known - cpu online possible) or "deconfigured" (topology location unknown - cpu online not possible). I can imagine several ways to fix this for s390, but before doing that I'm wondering if the workqueue code is correct with a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus that early and b) that the cpu_to_node() mapping does never change Tejun? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-15 11:19 ` Heiko Carstens @ 2016-08-15 22:48 ` Tejun Heo 2016-08-16 7:55 ` Heiko Carstens 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-08-15 22:48 UTC (permalink / raw) To: Heiko Carstens Cc: Ming Lei, Thomas Gleixner, Peter Zijlstra, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky Hello, Heiko. On Mon, Aug 15, 2016 at 01:19:08PM +0200, Heiko Carstens wrote: > I can imagine several ways to fix this for s390, but before doing that I'm > wondering if the workqueue code is correct with > > a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus > that early This can be debatable and making it "first registration sticks" is likely easy enough. > and > > b) that the cpu_to_node() mapping does never change However, this part isn't just from workqueue. It just hits in a more obvious way. For example, memory allocation has the same problem and we would have to synchronize memory allocations against cpu <-> node mapping changing. It'd be silly to add the complexity and overhead of making the mapping dynamic when that there's nothing inherently dynamic about it. The surface area is pretty big here. I have no idea how s390 fakenuma works. Is that very difficult from x86's? IIRC, x86's fakenuma isn't all that dynamic. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-15 22:48 ` Tejun Heo @ 2016-08-16 7:55 ` Heiko Carstens 2016-08-16 15:20 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Heiko Carstens @ 2016-08-16 7:55 UTC (permalink / raw) To: Tejun Heo Cc: Ming Lei, Thomas Gleixner, Peter Zijlstra, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky Hi Tejun, > On Mon, Aug 15, 2016 at 01:19:08PM +0200, Heiko Carstens wrote: > > I can imagine several ways to fix this for s390, but before doing that I'm > > wondering if the workqueue code is correct with > > > > a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus > > that early > > This can be debatable and making it "first registration sticks" is > likely easy enough. > > > and > > > > b) that the cpu_to_node() mapping does never change > > However, this part isn't just from workqueue. It just hits in a more > obvious way. For example, memory allocation has the same problem and > we would have to synchronize memory allocations against cpu <-> node > mapping changing. It'd be silly to add the complexity and overhead of > making the mapping dynamic when that there's nothing inherently > dynamic about it. The surface area is pretty big here. > > I have no idea how s390 fakenuma works. Is that very difficult from > x86's? IIRC, x86's fakenuma isn't all that dynamic. I'm not asking to make the cpu <-> node completely dynamic. We have already code in place to keep the cpu <-> node mapping static, however currently this happens too late, but can be fixed quite easily. Unfortunately we do not always know to which node a cpu belongs when we register it, currently all cpus will be registered to node 0 and only when a cpu is brought online this will be corrected. The problem we have are "standby" cpus on s390, for which we know they are present but can't use them currently. The mechanism is the following: We detect a standby cpu and register it via register_cpu(); since the node isn't known yet for this cpu, the cpu_to_node() function will return 0, therefore all standby cpus will be registered under node 0. The new standby cpu will have a "configure" sysfs attribute. If somebody writes "1" to it we signal the hypervisor that we want to use the cpu and it allocates one. If this request succeeds we finally know where the cpu is located topology wise and can fix up everything (and can also make the cpu to node mapping static). Note: as long as cpu isn't configured it cannot be brought online. If the cpu now is finally brought online the change_cpu_under_node() code within drivers/base/cpu.c fixes up the node symlinks so at least the sysfs representation is also correct. If later on the cpu is brought offline, deconfigured, etc. we do not change the cpu_to_node mapping anymore. So the question is how to define "first registration sticks". :) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-16 7:55 ` Heiko Carstens @ 2016-08-16 15:20 ` Tejun Heo 2016-08-16 15:29 ` Peter Zijlstra 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-08-16 15:20 UTC (permalink / raw) To: Heiko Carstens Cc: Ming Lei, Thomas Gleixner, Peter Zijlstra, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky Hello, Heiko. On Tue, Aug 16, 2016 at 09:55:05AM +0200, Heiko Carstens wrote: ... > > The new standby cpu will have a "configure" sysfs attribute. If somebody > writes "1" to it we signal the hypervisor that we want to use the cpu and > it allocates one. If this request succeeds we finally know where the cpu is > located topology wise and can fix up everything (and can also make the cpu > to node mapping static). > Note: as long as cpu isn't configured it cannot be brought online. I see, so the binding is actually dynamic. At least the first bring up. > If the cpu now is finally brought online the change_cpu_under_node() code > within drivers/base/cpu.c fixes up the node symlinks so at least the sysfs > representation is also correct. > > If later on the cpu is brought offline, deconfigured, etc. we do not change > the cpu_to_node mapping anymore. > > So the question is how to define "first registration sticks". :) As long as the mapping doesn't change after the first onlining of the CPU, the workqueue side shouldn't be too difficult to fix up. I'll look into it. For memory allocations, as long as the cpu <-> node mapping is established before any memory allocation for the cpu takes place, it should be fine too, I think. Anyways, give me some days. I'll look into updating workqueue so that it works with the mapping on the first onlining. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-16 15:20 ` Tejun Heo @ 2016-08-16 15:29 ` Peter Zijlstra 2016-08-16 15:42 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Peter Zijlstra @ 2016-08-16 15:29 UTC (permalink / raw) To: Tejun Heo Cc: Heiko Carstens, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote: > As long as the mapping doesn't change after the first onlining of the > CPU, the workqueue side shouldn't be too difficult to fix up. I'll > look into it. For memory allocations, as long as the cpu <-> node > mapping is established before any memory allocation for the cpu takes > place, it should be fine too, I think. Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's a whole bunch of per-cpu memory users that does things like: for_each_possible_cpu(cpu) { struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu); /* muck with foo */ } Which requires a cpu->node map for all possible cpus at boot time. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-16 15:29 ` Peter Zijlstra @ 2016-08-16 15:42 ` Tejun Heo 2016-08-16 22:19 ` Heiko Carstens 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-08-16 15:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Heiko Carstens, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky Hello, Peter. On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote: > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote: > > As long as the mapping doesn't change after the first onlining of the > > CPU, the workqueue side shouldn't be too difficult to fix up. I'll > > look into it. For memory allocations, as long as the cpu <-> node > > mapping is established before any memory allocation for the cpu takes > > place, it should be fine too, I think. > > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's > a whole bunch of per-cpu memory users that does things like: > > > for_each_possible_cpu(cpu) { > struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu); > > /* muck with foo */ > } > > > Which requires a cpu->node map for all possible cpus at boot time. Ah, right. If cpu -> node mapping is dynamic, there isn't much that we can do about allocating per-cpu memory on the wrong node. And it is problematic that percpu allocations can race against an onlining CPU switching its node association. One way to keep the mapping stable would be reserving per-node possible CPU slots so that the CPU number assigned to a new CPU is on the right node. It'd be a simple solution but would get really expensive with increasing number of nodes. Heiko, do you have any ideas? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-16 15:42 ` Tejun Heo @ 2016-08-16 22:19 ` Heiko Carstens 2016-08-17 9:20 ` Michael Holzheu 2016-08-17 13:58 ` Tejun Heo 0 siblings, 2 replies; 16+ messages in thread From: Heiko Carstens @ 2016-08-16 22:19 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote: > Hello, Peter. > > On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote: > > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote: > > > As long as the mapping doesn't change after the first onlining of the > > > CPU, the workqueue side shouldn't be too difficult to fix up. I'll > > > look into it. For memory allocations, as long as the cpu <-> node > > > mapping is established before any memory allocation for the cpu takes > > > place, it should be fine too, I think. > > > > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's > > a whole bunch of per-cpu memory users that does things like: > > > > > > for_each_possible_cpu(cpu) { > > struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu); > > > > /* muck with foo */ > > } > > > > > > Which requires a cpu->node map for all possible cpus at boot time. > > Ah, right. If cpu -> node mapping is dynamic, there isn't much that > we can do about allocating per-cpu memory on the wrong node. And it > is problematic that percpu allocations can race against an onlining > CPU switching its node association. > > One way to keep the mapping stable would be reserving per-node > possible CPU slots so that the CPU number assigned to a new CPU is on > the right node. It'd be a simple solution but would get really > expensive with increasing number of nodes. > > Heiko, do you have any ideas? I think the easiest solution would be to simply assign all cpus, for which we do not have any topology information, to an arbitrary node; e.g. round robin. After all the case that cpus are added later is rare and the s390 fake numa implementation does not know about the memory topology. All it is doing is distributing the memory to several nodes in order to avoid a single huge node. So that should be sort of ok. Unless somebody has a better idea? Michael, Martin? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-16 22:19 ` Heiko Carstens @ 2016-08-17 9:20 ` Michael Holzheu 2016-08-17 13:58 ` Tejun Heo 1 sibling, 0 replies; 16+ messages in thread From: Michael Holzheu @ 2016-08-17 9:20 UTC (permalink / raw) To: Heiko Carstens Cc: Tejun Heo, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Martin Schwidefsky Am Wed, 17 Aug 2016 00:19:53 +0200 schrieb Heiko Carstens <heiko.carstens@de.ibm.com>: > On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote: > > Hello, Peter. > > > > On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote: > > > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote: > > > > As long as the mapping doesn't change after the first onlining > > > > of the CPU, the workqueue side shouldn't be too difficult to > > > > fix up. I'll look into it. For memory allocations, as long as > > > > the cpu <-> node mapping is established before any memory > > > > allocation for the cpu takes place, it should be fine too, I > > > > think. > > > > > > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? > > > There's a whole bunch of per-cpu memory users that does things > > > like: > > > > > > > > > for_each_possible_cpu(cpu) { > > > struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu); > > > > > > /* muck with foo */ > > > } > > > > > > > > > Which requires a cpu->node map for all possible cpus at boot time. > > > > Ah, right. If cpu -> node mapping is dynamic, there isn't much that > > we can do about allocating per-cpu memory on the wrong node. And it > > is problematic that percpu allocations can race against an onlining > > CPU switching its node association. > > > > One way to keep the mapping stable would be reserving per-node > > possible CPU slots so that the CPU number assigned to a new CPU is > > on the right node. It'd be a simple solution but would get really > > expensive with increasing number of nodes. > > > > Heiko, do you have any ideas? > > I think the easiest solution would be to simply assign all cpus, for > which we do not have any topology information, to an arbitrary node; > e.g. round robin. > > After all the case that cpus are added later is rare and the s390 > fake numa implementation does not know about the memory topology. All > it is doing is distributing the memory to several nodes in order to > avoid a single huge node. So that should be sort of ok. > > Unless somebody has a better idea? > > Michael, Martin? If it is really required that cpu_to_node() can be called for all possible cpus this sounds like a reasonable workaround to me. Michael ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-16 22:19 ` Heiko Carstens 2016-08-17 9:20 ` Michael Holzheu @ 2016-08-17 13:58 ` Tejun Heo 2016-08-18 9:30 ` Michael Holzheu 1 sibling, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-08-17 13:58 UTC (permalink / raw) To: Heiko Carstens Cc: Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu, Martin Schwidefsky Hello, Heiko. On Wed, Aug 17, 2016 at 12:19:53AM +0200, Heiko Carstens wrote: > I think the easiest solution would be to simply assign all cpus, for which > we do not have any topology information, to an arbitrary node; e.g. round > robin. > > After all the case that cpus are added later is rare and the s390 fake numa > implementation does not know about the memory topology. All it is doing is Ah, okay, so there really is no requirement for a newly coming up cpu to be on a specific node. > distributing the memory to several nodes in order to avoid a single huge > node. So that should be sort of ok. Sounds good to me. If that's the only purpose, we don't lose much by round-robining the possible CPUs on boot and sticking with the mapping. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-17 13:58 ` Tejun Heo @ 2016-08-18 9:30 ` Michael Holzheu 2016-08-18 14:42 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Michael Holzheu @ 2016-08-18 9:30 UTC (permalink / raw) To: Tejun Heo Cc: Heiko Carstens, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Martin Schwidefsky Am Wed, 17 Aug 2016 09:58:55 -0400 schrieb Tejun Heo <tj@kernel.org>: > Hello, Heiko. > > On Wed, Aug 17, 2016 at 12:19:53AM +0200, Heiko Carstens wrote: > > I think the easiest solution would be to simply assign all cpus, > > for which we do not have any topology information, to an arbitrary > > node; e.g. round robin. > > > > After all the case that cpus are added later is rare and the s390 > > fake numa implementation does not know about the memory topology. > > All it is doing is > > Ah, okay, so there really is no requirement for a newly coming up cpu > to be on a specific node. Well, "no requirement" this is not 100% correct. Currently we use the CPU topology information to assign newly coming CPUs to the "best fitting" node. Example: 1) We have we two fake NUMA nodes N1 and N2 with the following CPU assignment: - N1: cpu 1 on chip 1 - N2: cpu 2 on chip 2 2) A new cpu 3 is configured that lives on chip 2 3) We assign cpu 3 to N2 We do this only if the nodes are balanced. If N2 had already one more cpu than N1 we would assign the new cpu to N1. Michael ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-18 9:30 ` Michael Holzheu @ 2016-08-18 14:42 ` Tejun Heo 2016-08-19 9:52 ` Michael Holzheu 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2016-08-18 14:42 UTC (permalink / raw) To: Michael Holzheu Cc: Heiko Carstens, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Martin Schwidefsky Hello, Michael. On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote: > Well, "no requirement" this is not 100% correct. Currently we use the > CPU topology information to assign newly coming CPUs to the "best > fitting" node. > > Example: > > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU > assignment: > > - N1: cpu 1 on chip 1 > - N2: cpu 2 on chip 2 > > 2) A new cpu 3 is configured that lives on chip 2 > 3) We assign cpu 3 to N2 > > We do this only if the nodes are balanced. If N2 had already one more > cpu than N1 we would assign the new cpu to N1. I see. Out of curiosity, what's the purpose of fakenuma on s390? There don't seem to be any actual memory locality concerns. Is it just to segment memory of a machine into multiple pieces? If so, why is that necessary, do you hit some scalability issues w/o NUMA nodes? As for the solution, if blind RR isn't good enough, although it sounds like it could given that the balancing wasn't all that strong to begin with, would it be an option to implement an interface which just requests a new CPU rather than a specific one and then pick one of the vacant possible CPUs considering node balancing? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning 2016-08-18 14:42 ` Tejun Heo @ 2016-08-19 9:52 ` Michael Holzheu 0 siblings, 0 replies; 16+ messages in thread From: Michael Holzheu @ 2016-08-19 9:52 UTC (permalink / raw) To: Tejun Heo Cc: Heiko Carstens, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML, Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Martin Schwidefsky Am Thu, 18 Aug 2016 10:42:08 -0400 schrieb Tejun Heo <tj@kernel.org>: > Hello, Michael. > > On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote: > > Well, "no requirement" this is not 100% correct. Currently we use > > the CPU topology information to assign newly coming CPUs to the > > "best fitting" node. > > > > Example: > > > > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU > > assignment: > > > > - N1: cpu 1 on chip 1 > > - N2: cpu 2 on chip 2 > > > > 2) A new cpu 3 is configured that lives on chip 2 > > 3) We assign cpu 3 to N2 > > > > We do this only if the nodes are balanced. If N2 had already one > > more cpu than N1 we would assign the new cpu to N1. > > I see. Out of curiosity, what's the purpose of fakenuma on s390? > There don't seem to be any actual memory locality concerns. Is it > just to segment memory of a machine into multiple pieces? Correct. > If so, why > is that necessary, do you hit some scalability issues w/o NUMA nodes? Yes we hit a scalability issue. Our performance team found out that for big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we see problems: - Zone locks are highly contended because ZONE_NORMAL is big: * zone->lock * zone->lru_lock - One kswapd is not enough for swapping We hope that those problems are resolved by fake NUMA because for each node a separate memory subsystem is created with separate zone locks and kswapd threads. > As for the solution, if blind RR isn't good enough, although it sounds > like it could given that the balancing wasn't all that strong to begin > with, would it be an option to implement an interface which just > requests a new CPU rather than a specific one and then pick one of the > vacant possible CPUs considering node balancing? IMHO this is a promising idea. To say it in my words: - At boot time we already pin all remaining "not configured" logical CPUs to nodes. So all possible cpus are pinned to nodes and cpu_to_node() will work. - If a new physical cpu get's configured, we get the CPU topology information from the system and find the best node. - We get a logical cpu number from the node pool and assign the new physical cpu to that number. If that works we would be as good as before. We will have a look into the code if it is possible. Michael ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2016-08-19 9:52 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-07-27 12:54 [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning Heiko Carstens 2016-07-27 15:23 ` Thomas Gleixner 2016-07-30 11:25 ` Heiko Carstens 2016-08-08 7:45 ` Ming Lei 2016-08-15 11:19 ` Heiko Carstens 2016-08-15 22:48 ` Tejun Heo 2016-08-16 7:55 ` Heiko Carstens 2016-08-16 15:20 ` Tejun Heo 2016-08-16 15:29 ` Peter Zijlstra 2016-08-16 15:42 ` Tejun Heo 2016-08-16 22:19 ` Heiko Carstens 2016-08-17 9:20 ` Michael Holzheu 2016-08-17 13:58 ` Tejun Heo 2016-08-18 9:30 ` Michael Holzheu 2016-08-18 14:42 ` Tejun Heo 2016-08-19 9:52 ` Michael Holzheu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).