linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
@ 2016-07-27 12:54 Heiko Carstens
  2016-07-27 15:23 ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Heiko Carstens @ 2016-07-27 12:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Thomas Gleixner

Hi Peter,

I get the following warning on s390 when using fake NUMA beginning with
your patch
e9d867a67fd0 "sched: Allow per-cpu kernel threads to run on online && !active"

[    3.162909] WARNING: CPU: 0 PID: 1 at include/linux/cpumask.h:121 select_task_rq+0xe6/0x1a8
[    3.162911] Modules linked in:
[    3.162914] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.6.0-rc6-00001-ge9d867a67fd0-dirty #28
[    3.162917] task: 00000001dd270008 ti: 00000001eccb4000 task.ti: 00000001eccb4000
[    3.162918] Krnl PSW : 0404c00180000000 0000000000176c56 (select_task_rq+0xe6/0x1a8)
[    3.162923]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
               Krnl GPRS: 00000000009f3d00 00000001dd270008 0000000000000100 00000000000000f4
[    3.162927]            0000000000eaf4e2 0000000000000000 00000001eccb7bc0 0400000000000001
[    3.162929]            00000001ec660950 0000000000000000 00000001ec660520 00000001ec660008
[    3.162930]            0000000000000100 000000000099c1a0 0000000000176c30 00000001eccb7a70
[    3.162940] Krnl Code: 0000000000176c4a: a774002f            brc     7,176ca8
                          0000000000176c4e: 92014000            mvi     0(%r4),1
                         #0000000000176c52: a7f40001            brc     15,176c54
                         >0000000000176c56: a7f40029            brc     15,176ca8
                          0000000000176c5a: 95004000            cli     0(%r4),0
                          0000000000176c5e: a7740006            brc     7,176c6a
                          0000000000176c62: 92014000            mvi     0(%r4),1
                          0000000000176c66: a7f40001            brc     15,176c68
[    3.162958] Call Trace:
[    3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
[    3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
[    3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
[    3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)
[    3.162973] ([<0000000000162550>] apply_wqattrs_prepare+0x200/0x2a0)
[    3.162975] ([<000000000016266a>] apply_workqueue_attrs_locked+0x7a/0xb0)
[    3.162977] ([<0000000000162af0>] apply_workqueue_attrs+0x50/0x78)
[    3.162979] ([<000000000016441c>] __alloc_workqueue_key+0x304/0x520)
[    3.162983] ([<0000000000ee3706>] default_bdi_init+0x3e/0x70)
[    3.162986] ([<0000000000100270>] do_one_initcall+0x140/0x1d8)
[    3.162990] ([<0000000000ec9da8>] kernel_init_freeable+0x220/0x2d8)
[    3.162993] ([<0000000000984a7a>] kernel_init+0x2a/0x150)
[    3.162996] ([<00000000009913fa>] kernel_thread_starter+0x6/0xc)
[    3.162998] ([<00000000009913f4>] kernel_thread_starter+0x0/0xc)
[    3.163000] 4 locks held by swapper/0/1:
[    3.163002]  #0:  (cpu_hotplug.lock){++++++}, at: [<000000000013ebe0>] get_online_cpus+0x48/0xb8
[    3.163010]  #1:  (wq_pool_mutex){+.+.+.}, at: [<0000000000162ae2>] apply_workqueue_attrs+0x42/0x78
[    3.163016]  #2:  (&pool->lock/1){......}, at: [<000000000015d44a>] create_worker+0x152/0x1c0
[    3.163022]  #3:  (&p->pi_lock){..-...}, at: [<0000000000177ac4>] try_to_wake_up+0x44/0x478
[    3.163028] Last Breaking-Event-Address:
[    3.163030]  [<0000000000176c52>] select_task_rq+0xe2/0x1a8

For some unknown reason select_task_rq() gets called with a task that has
nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
warning later on.

It only happens with more than one node, otherwise it seems to work fine.

Any idea what could be wrong here?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-07-27 12:54 [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning Heiko Carstens
@ 2016-07-27 15:23 ` Thomas Gleixner
  2016-07-30 11:25   ` Heiko Carstens
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2016-07-27 15:23 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Peter Zijlstra, LKML, Tejun Heo

On Wed, 27 Jul 2016, Heiko Carstens wrote:
> [    3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
> [    3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
> [    3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
> [    3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)

> For some unknown reason select_task_rq() gets called with a task that has
> nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
> within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
> warning later on.
> 
> It only happens with more than one node, otherwise it seems to work fine.
> 
> Any idea what could be wrong here?

create_worker()
    tsk = kthread_create_on_node();
    kthread_bind_mask(tsk, pool->attrs->cpumask);
        do_set_cpus_allowed(tsk, mask);
            set_cpus_allowed_common(tsk, mask);
                cpumask_copy(&tsk->cpus_allowed, mask);
                tsk->nr_cpus_allowed = cpumask_weight(mask);
    wake_up_process(task);

So this looks like pool->attrs->cpumask is simply empty.....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-07-27 15:23 ` Thomas Gleixner
@ 2016-07-30 11:25   ` Heiko Carstens
  2016-08-08  7:45     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: Heiko Carstens @ 2016-07-30 11:25 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Peter Zijlstra, LKML, Tejun Heo

On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote:
> On Wed, 27 Jul 2016, Heiko Carstens wrote:
> > [    3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
> > [    3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
> > [    3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
> > [    3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)
> 
> > For some unknown reason select_task_rq() gets called with a task that has
> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
> > warning later on.
> > 
> > It only happens with more than one node, otherwise it seems to work fine.
> > 
> > Any idea what could be wrong here?
> 
> create_worker()
>     tsk = kthread_create_on_node();
>     kthread_bind_mask(tsk, pool->attrs->cpumask);
>         do_set_cpus_allowed(tsk, mask);
>             set_cpus_allowed_common(tsk, mask);
>                 cpumask_copy(&tsk->cpus_allowed, mask);
>                 tsk->nr_cpus_allowed = cpumask_weight(mask);
>     wake_up_process(task);
> 
> So this looks like pool->attrs->cpumask is simply empty.....

Just had some time to look into this a bit more. Looks like we initialize
the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's
patch just revealed that problem.

I'll see if initializing the masks earlier will fix this, but I think it
will.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-07-30 11:25   ` Heiko Carstens
@ 2016-08-08  7:45     ` Ming Lei
  2016-08-15 11:19       ` Heiko Carstens
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2016-08-08  7:45 UTC (permalink / raw)
  To: Heiko Carstens; +Cc: Thomas Gleixner, Peter Zijlstra, LKML, Tejun Heo

On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens
<heiko.carstens@de.ibm.com> wrote:
> On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote:
>> On Wed, 27 Jul 2016, Heiko Carstens wrote:
>> > [    3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
>> > [    3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
>> > [    3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
>> > [    3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)
>>
>> > For some unknown reason select_task_rq() gets called with a task that has
>> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
>> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
>> > warning later on.
>> >
>> > It only happens with more than one node, otherwise it seems to work fine.
>> >
>> > Any idea what could be wrong here?
>>
>> create_worker()
>>     tsk = kthread_create_on_node();
>>     kthread_bind_mask(tsk, pool->attrs->cpumask);
>>         do_set_cpus_allowed(tsk, mask);
>>             set_cpus_allowed_common(tsk, mask);
>>                 cpumask_copy(&tsk->cpus_allowed, mask);
>>                 tsk->nr_cpus_allowed = cpumask_weight(mask);
>>     wake_up_process(task);
>>
>> So this looks like pool->attrs->cpumask is simply empty.....
>
> Just had some time to look into this a bit more. Looks like we initialize
> the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's
> patch just revealed that problem.
>
> I'll see if initializing the masks earlier will fix this, but I think it
> will.

Hello,

Is there any fix for this issue?  I can see the issue on arm64 running
v4.7 kernel too.  And the oops can be avoided by reverting commit
e9d867a(sched: Allow per-cpu kernel threads to run on online && !active).


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-08  7:45     ` Ming Lei
@ 2016-08-15 11:19       ` Heiko Carstens
  2016-08-15 22:48         ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Heiko Carstens @ 2016-08-15 11:19 UTC (permalink / raw)
  To: Ming Lei, Tejun Heo
  Cc: Thomas Gleixner, Peter Zijlstra, LKML, Yasuaki Ishimatsu,
	Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

On Mon, Aug 08, 2016 at 03:45:05PM +0800, Ming Lei wrote:
> On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens
> <heiko.carstens@de.ibm.com> wrote:
> > On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote:
> >> On Wed, 27 Jul 2016, Heiko Carstens wrote:
> >> > [    3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
> >> > [    3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
> >> > [    3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
> >> > [    3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)
> >>
> >> > For some unknown reason select_task_rq() gets called with a task that has
> >> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
> >> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
> >> > warning later on.
> >> >
> >> > It only happens with more than one node, otherwise it seems to work fine.
> >> >
> >> > Any idea what could be wrong here?
> >>
> >> create_worker()
> >>     tsk = kthread_create_on_node();
> >>     kthread_bind_mask(tsk, pool->attrs->cpumask);
> >>         do_set_cpus_allowed(tsk, mask);
> >>             set_cpus_allowed_common(tsk, mask);
> >>                 cpumask_copy(&tsk->cpus_allowed, mask);
> >>                 tsk->nr_cpus_allowed = cpumask_weight(mask);
> >>     wake_up_process(task);
> >>
> >> So this looks like pool->attrs->cpumask is simply empty.....
> >
> > Just had some time to look into this a bit more. Looks like we initialize
> > the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's
> > patch just revealed that problem.
> >
> > I'll see if initializing the masks earlier will fix this, but I think it
> > will.
> 
> Hello,
> 
> Is there any fix for this issue?  I can see the issue on arm64 running
> v4.7 kernel too.  And the oops can be avoided by reverting commit
> e9d867a(sched: Allow per-cpu kernel threads to run on online && !active).

I don't know about the arm64 issue. The s390 problem is a result from
initializing the cpu_to_node mapping too late.

However, the workqueue code seems to assume that we know the cpu_to_node
mapping for all _possible_ cpus very early and apparently it assumes that
this mapping is stable and doesn't change anymore.

This assumption however contradicts the purpose of 346404682434 ("numa, cpu
hotplug: change links of CPU and node when changing node number by onlining
CPU").

So something is wrong here...

On s390 with fake numa we wouldn't even know the mapping of all _possible_
cpus at boot time. When establishing the node mapping we try hard to map
our existing cpu topology into a sane node mapping. However we simply don't
know where non-present cpus are located topology-wise.  Even for present
cpus the answer is not always there since present cpus can be in either the
state "configured" (topology location known - cpu online possible) or
"deconfigured" (topology location unknown - cpu online not possible).

I can imagine several ways to fix this for s390, but before doing that I'm
wondering if the workqueue code is correct with

a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus
   that early

and

b) that the cpu_to_node() mapping does never change

Tejun?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-15 11:19       ` Heiko Carstens
@ 2016-08-15 22:48         ` Tejun Heo
  2016-08-16  7:55           ` Heiko Carstens
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-08-15 22:48 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ming Lei, Thomas Gleixner, Peter Zijlstra, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

Hello, Heiko.

On Mon, Aug 15, 2016 at 01:19:08PM +0200, Heiko Carstens wrote:
> I can imagine several ways to fix this for s390, but before doing that I'm
> wondering if the workqueue code is correct with
> 
> a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus
>    that early

This can be debatable and making it "first registration sticks" is
likely easy enough.

> and
> 
> b) that the cpu_to_node() mapping does never change

However, this part isn't just from workqueue.  It just hits in a more
obvious way.  For example, memory allocation has the same problem and
we would have to synchronize memory allocations against cpu <-> node
mapping changing.  It'd be silly to add the complexity and overhead of
making the mapping dynamic when that there's nothing inherently
dynamic about it.  The surface area is pretty big here.

I have no idea how s390 fakenuma works.  Is that very difficult from
x86's?  IIRC, x86's fakenuma isn't all that dynamic.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-15 22:48         ` Tejun Heo
@ 2016-08-16  7:55           ` Heiko Carstens
  2016-08-16 15:20             ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Heiko Carstens @ 2016-08-16  7:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Ming Lei, Thomas Gleixner, Peter Zijlstra, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

Hi Tejun,

> On Mon, Aug 15, 2016 at 01:19:08PM +0200, Heiko Carstens wrote:
> > I can imagine several ways to fix this for s390, but before doing that I'm
> > wondering if the workqueue code is correct with
> > 
> > a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus
> >    that early
> 
> This can be debatable and making it "first registration sticks" is
> likely easy enough.
> 
> > and
> > 
> > b) that the cpu_to_node() mapping does never change
> 
> However, this part isn't just from workqueue.  It just hits in a more
> obvious way.  For example, memory allocation has the same problem and
> we would have to synchronize memory allocations against cpu <-> node
> mapping changing.  It'd be silly to add the complexity and overhead of
> making the mapping dynamic when that there's nothing inherently
> dynamic about it.  The surface area is pretty big here.
> 
> I have no idea how s390 fakenuma works.  Is that very difficult from
> x86's?  IIRC, x86's fakenuma isn't all that dynamic.

I'm not asking to make the cpu <-> node completely dynamic. We have already
code in place to keep the cpu <-> node mapping static, however currently
this happens too late, but can be fixed quite easily.

Unfortunately we do not always know to which node a cpu belongs when we
register it, currently all cpus will be registered to node 0 and only when
a cpu is brought online this will be corrected.

The problem we have are "standby" cpus on s390, for which we know they are
present but can't use them currently. The mechanism is the following:

We detect a standby cpu and register it via register_cpu(); since the node
isn't known yet for this cpu, the cpu_to_node() function will return 0,
therefore all standby cpus will be registered under node 0.

The new standby cpu will have a "configure" sysfs attribute. If somebody
writes "1" to it we signal the hypervisor that we want to use the cpu and
it allocates one. If this request succeeds we finally know where the cpu is
located topology wise and can fix up everything (and can also make the cpu
to node mapping static).
Note: as long as cpu isn't configured it cannot be brought online.

If the cpu now is finally brought online the change_cpu_under_node() code
within drivers/base/cpu.c fixes up the node symlinks so at least the sysfs
representation is also correct.

If later on the cpu is brought offline, deconfigured, etc. we do not change
the cpu_to_node mapping anymore.

So the question is how to define "first registration sticks". :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-16  7:55           ` Heiko Carstens
@ 2016-08-16 15:20             ` Tejun Heo
  2016-08-16 15:29               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-08-16 15:20 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Ming Lei, Thomas Gleixner, Peter Zijlstra, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

Hello, Heiko.

On Tue, Aug 16, 2016 at 09:55:05AM +0200, Heiko Carstens wrote:
...
>
> The new standby cpu will have a "configure" sysfs attribute. If somebody
> writes "1" to it we signal the hypervisor that we want to use the cpu and
> it allocates one. If this request succeeds we finally know where the cpu is
> located topology wise and can fix up everything (and can also make the cpu
> to node mapping static).
> Note: as long as cpu isn't configured it cannot be brought online.

I see, so the binding is actually dynamic.  At least the first bring
up.

> If the cpu now is finally brought online the change_cpu_under_node() code
> within drivers/base/cpu.c fixes up the node symlinks so at least the sysfs
> representation is also correct.
> 
> If later on the cpu is brought offline, deconfigured, etc. we do not change
> the cpu_to_node mapping anymore.
> 
> So the question is how to define "first registration sticks". :)

As long as the mapping doesn't change after the first onlining of the
CPU, the workqueue side shouldn't be too difficult to fix up.  I'll
look into it.  For memory allocations, as long as the cpu <-> node
mapping is established before any memory allocation for the cpu takes
place, it should be fine too, I think.

Anyways, give me some days.  I'll look into updating workqueue so that
it works with the mapping on the first onlining.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-16 15:20             ` Tejun Heo
@ 2016-08-16 15:29               ` Peter Zijlstra
  2016-08-16 15:42                 ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2016-08-16 15:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Heiko Carstens, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> As long as the mapping doesn't change after the first onlining of the
> CPU, the workqueue side shouldn't be too difficult to fix up.  I'll
> look into it.  For memory allocations, as long as the cpu <-> node
> mapping is established before any memory allocation for the cpu takes
> place, it should be fine too, I think.

Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's
a whole bunch of per-cpu memory users that does things like:


	for_each_possible_cpu(cpu) {
		struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu);

		/* muck with foo */
	}


Which requires a cpu->node map for all possible cpus at boot time.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-16 15:29               ` Peter Zijlstra
@ 2016-08-16 15:42                 ` Tejun Heo
  2016-08-16 22:19                   ` Heiko Carstens
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-08-16 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

Hello, Peter.

On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> > As long as the mapping doesn't change after the first onlining of the
> > CPU, the workqueue side shouldn't be too difficult to fix up.  I'll
> > look into it.  For memory allocations, as long as the cpu <-> node
> > mapping is established before any memory allocation for the cpu takes
> > place, it should be fine too, I think.
> 
> Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's
> a whole bunch of per-cpu memory users that does things like:
> 
> 
> 	for_each_possible_cpu(cpu) {
> 		struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu);
> 
> 		/* muck with foo */
> 	}
> 
> 
> Which requires a cpu->node map for all possible cpus at boot time.

Ah, right.  If cpu -> node mapping is dynamic, there isn't much that
we can do about allocating per-cpu memory on the wrong node.  And it
is problematic that percpu allocations can race against an onlining
CPU switching its node association.

One way to keep the mapping stable would be reserving per-node
possible CPU slots so that the CPU number assigned to a new CPU is on
the right node.  It'd be a simple solution but would get really
expensive with increasing number of nodes.

Heiko, do you have any ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-16 15:42                 ` Tejun Heo
@ 2016-08-16 22:19                   ` Heiko Carstens
  2016-08-17  9:20                     ` Michael Holzheu
  2016-08-17 13:58                     ` Tejun Heo
  0 siblings, 2 replies; 16+ messages in thread
From: Heiko Carstens @ 2016-08-16 22:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> > > As long as the mapping doesn't change after the first onlining of the
> > > CPU, the workqueue side shouldn't be too difficult to fix up.  I'll
> > > look into it.  For memory allocations, as long as the cpu <-> node
> > > mapping is established before any memory allocation for the cpu takes
> > > place, it should be fine too, I think.
> > 
> > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot? There's
> > a whole bunch of per-cpu memory users that does things like:
> > 
> > 
> > 	for_each_possible_cpu(cpu) {
> > 		struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu);
> > 
> > 		/* muck with foo */
> > 	}
> > 
> > 
> > Which requires a cpu->node map for all possible cpus at boot time.
> 
> Ah, right.  If cpu -> node mapping is dynamic, there isn't much that
> we can do about allocating per-cpu memory on the wrong node.  And it
> is problematic that percpu allocations can race against an onlining
> CPU switching its node association.
> 
> One way to keep the mapping stable would be reserving per-node
> possible CPU slots so that the CPU number assigned to a new CPU is on
> the right node.  It'd be a simple solution but would get really
> expensive with increasing number of nodes.
> 
> Heiko, do you have any ideas?

I think the easiest solution would be to simply assign all cpus, for which
we do not have any topology information, to an arbitrary node; e.g. round
robin.

After all the case that cpus are added later is rare and the s390 fake numa
implementation does not know about the memory topology. All it is doing is
distributing the memory to several nodes in order to avoid a single huge
node. So that should be sort of ok.

Unless somebody has a better idea?

Michael, Martin?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-16 22:19                   ` Heiko Carstens
@ 2016-08-17  9:20                     ` Michael Holzheu
  2016-08-17 13:58                     ` Tejun Heo
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Holzheu @ 2016-08-17  9:20 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Tejun Heo, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan,
	Martin Schwidefsky

Am Wed, 17 Aug 2016 00:19:53 +0200
schrieb Heiko Carstens <heiko.carstens@de.ibm.com>:

> On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> > > > As long as the mapping doesn't change after the first onlining
> > > > of the CPU, the workqueue side shouldn't be too difficult to
> > > > fix up.  I'll look into it.  For memory allocations, as long as
> > > > the cpu <-> node mapping is established before any memory
> > > > allocation for the cpu takes place, it should be fine too, I
> > > > think.
> > > 
> > > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot?
> > > There's a whole bunch of per-cpu memory users that does things
> > > like:
> > > 
> > > 
> > > 	for_each_possible_cpu(cpu) {
> > > 		struct foo *foo = per_cpu_ptr(&per_cpu_var, cpu);
> > > 
> > > 		/* muck with foo */
> > > 	}
> > > 
> > > 
> > > Which requires a cpu->node map for all possible cpus at boot time.
> > 
> > Ah, right.  If cpu -> node mapping is dynamic, there isn't much that
> > we can do about allocating per-cpu memory on the wrong node.  And it
> > is problematic that percpu allocations can race against an onlining
> > CPU switching its node association.
> > 
> > One way to keep the mapping stable would be reserving per-node
> > possible CPU slots so that the CPU number assigned to a new CPU is
> > on the right node.  It'd be a simple solution but would get really
> > expensive with increasing number of nodes.
> > 
> > Heiko, do you have any ideas?
> 
> I think the easiest solution would be to simply assign all cpus, for
> which we do not have any topology information, to an arbitrary node;
> e.g. round robin.
> 
> After all the case that cpus are added later is rare and the s390
> fake numa implementation does not know about the memory topology. All
> it is doing is distributing the memory to several nodes in order to
> avoid a single huge node. So that should be sort of ok.
> 
> Unless somebody has a better idea?
> 
> Michael, Martin?

If it is really required that cpu_to_node() can be called for
all possible cpus this sounds like a reasonable workaround to me.

Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-16 22:19                   ` Heiko Carstens
  2016-08-17  9:20                     ` Michael Holzheu
@ 2016-08-17 13:58                     ` Tejun Heo
  2016-08-18  9:30                       ` Michael Holzheu
  1 sibling, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-08-17 13:58 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan, Michael Holzheu,
	Martin Schwidefsky

Hello, Heiko.

On Wed, Aug 17, 2016 at 12:19:53AM +0200, Heiko Carstens wrote:
> I think the easiest solution would be to simply assign all cpus, for which
> we do not have any topology information, to an arbitrary node; e.g. round
> robin.
> 
> After all the case that cpus are added later is rare and the s390 fake numa
> implementation does not know about the memory topology. All it is doing is

Ah, okay, so there really is no requirement for a newly coming up cpu
to be on a specific node.

> distributing the memory to several nodes in order to avoid a single huge
> node. So that should be sort of ok.

Sounds good to me.  If that's the only purpose, we don't lose much by
round-robining the possible CPUs on boot and sticking with the
mapping.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-17 13:58                     ` Tejun Heo
@ 2016-08-18  9:30                       ` Michael Holzheu
  2016-08-18 14:42                         ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Michael Holzheu @ 2016-08-18  9:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Heiko Carstens, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan,
	Martin Schwidefsky

Am Wed, 17 Aug 2016 09:58:55 -0400
schrieb Tejun Heo <tj@kernel.org>:

> Hello, Heiko.
> 
> On Wed, Aug 17, 2016 at 12:19:53AM +0200, Heiko Carstens wrote:
> > I think the easiest solution would be to simply assign all cpus,
> > for which we do not have any topology information, to an arbitrary
> > node; e.g. round robin.
> > 
> > After all the case that cpus are added later is rare and the s390
> > fake numa implementation does not know about the memory topology.
> > All it is doing is
> 
> Ah, okay, so there really is no requirement for a newly coming up cpu
> to be on a specific node.

Well, "no requirement" this is not 100% correct. Currently we use the
CPU topology information to assign newly coming CPUs to the "best
fitting" node.

Example:

1) We have we two fake NUMA nodes N1 and N2 with the following CPU
   assignment:

   - N1: cpu 1 on chip 1
   - N2: cpu 2 on chip 2

2) A new cpu 3 is configured that lives on chip 2
3) We assign cpu 3 to N2

We do this only if the nodes are balanced. If N2 had already one more
cpu than N1 we would assign the new cpu to N1.

Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-18  9:30                       ` Michael Holzheu
@ 2016-08-18 14:42                         ` Tejun Heo
  2016-08-19  9:52                           ` Michael Holzheu
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2016-08-18 14:42 UTC (permalink / raw)
  To: Michael Holzheu
  Cc: Heiko Carstens, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan,
	Martin Schwidefsky

Hello, Michael.

On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> Well, "no requirement" this is not 100% correct. Currently we use the
> CPU topology information to assign newly coming CPUs to the "best
> fitting" node.
> 
> Example:
> 
> 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
>    assignment:
> 
>    - N1: cpu 1 on chip 1
>    - N2: cpu 2 on chip 2
> 
> 2) A new cpu 3 is configured that lives on chip 2
> 3) We assign cpu 3 to N2
> 
> We do this only if the nodes are balanced. If N2 had already one more
> cpu than N1 we would assign the new cpu to N1.

I see.  Out of curiosity, what's the purpose of fakenuma on s390?
There don't seem to be any actual memory locality concerns.  Is it
just to segment memory of a machine into multiple pieces?  If so, why
is that necessary, do you hit some scalability issues w/o NUMA nodes?

As for the solution, if blind RR isn't good enough, although it sounds
like it could given that the balancing wasn't all that strong to begin
with, would it be an option to implement an interface which just
requests a new CPU rather than a specific one and then pick one of the
vacant possible CPUs considering node balancing?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning
  2016-08-18 14:42                         ` Tejun Heo
@ 2016-08-19  9:52                           ` Michael Holzheu
  0 siblings, 0 replies; 16+ messages in thread
From: Michael Holzheu @ 2016-08-19  9:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Heiko Carstens, Peter Zijlstra, Ming Lei, Thomas Gleixner, LKML,
	Yasuaki Ishimatsu, Andrew Morton, Lai Jiangshan,
	Martin Schwidefsky

Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo <tj@kernel.org>:

> Hello, Michael.
> 
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> > 
> > Example:
> > 
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> >    assignment:
> > 
> >    - N1: cpu 1 on chip 1
> >    - N2: cpu 2 on chip 2
> > 
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> > 
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
> 
> I see.  Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns.  Is it
> just to segment memory of a machine into multiple pieces?

Correct.

> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?

Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:

 - Zone locks are highly contended because ZONE_NORMAL is big:
   * zone->lock
   * zone->lru_lock
 - One kswapd is not enough for swapping

We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.

> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?

IMHO this is a promising idea. To say it in my words:

 - At boot time we already pin all remaining "not configured" logical
   CPUs to nodes. So all possible cpus are pinned to nodes and
   cpu_to_node() will work.

 - If a new physical cpu get's configured, we get the CPU topology
   information from the system and find the best node.

 - We get a logical cpu number from the node pool and assign the
   new physical cpu to that number.

If that works we would be as good as before. We will have a look into
the code if it is possible.

Michael

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-19  9:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-27 12:54 [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning Heiko Carstens
2016-07-27 15:23 ` Thomas Gleixner
2016-07-30 11:25   ` Heiko Carstens
2016-08-08  7:45     ` Ming Lei
2016-08-15 11:19       ` Heiko Carstens
2016-08-15 22:48         ` Tejun Heo
2016-08-16  7:55           ` Heiko Carstens
2016-08-16 15:20             ` Tejun Heo
2016-08-16 15:29               ` Peter Zijlstra
2016-08-16 15:42                 ` Tejun Heo
2016-08-16 22:19                   ` Heiko Carstens
2016-08-17  9:20                     ` Michael Holzheu
2016-08-17 13:58                     ` Tejun Heo
2016-08-18  9:30                       ` Michael Holzheu
2016-08-18 14:42                         ` Tejun Heo
2016-08-19  9:52                           ` Michael Holzheu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).