linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched: fix infinity loop in update_blocked_averages
@ 2018-12-27  3:04 Xie XiuQi
  2018-12-27  9:21 ` Vincent Guittot
  2018-12-30 13:00 ` [tip:sched/urgent] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c tip-bot for Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Xie XiuQi @ 2018-12-27  3:04 UTC (permalink / raw)
  To: mingo, vincent.guittot, peterz, xiezhipeng1; +Cc: huawei.libin, linux-kernel

Zhepeng Xie report a bug, there is a infinity loop in
update_blocked_averages().

PID: 14233  TASK: ffff800b2de08fc0  CPU: 1   COMMAND: "docker"
 #0 [ffff00002213b9d0] update_blocked_averages at ffff00000811e4a8
 #1 [ffff00002213ba60] pick_next_task_fair at ffff00000812a3b4
 #2 [ffff00002213baf0] __schedule at ffff000008deaa88
 #3 [ffff00002213bb70] schedule at ffff000008deb1b8
 #4 [ffff00002213bb80] futex_wait_queue_me at ffff000008180754
 #5 [ffff00002213bbd0] futex_wait at ffff00000818192c
 #6 [ffff00002213bd00] do_futex at ffff000008183ee4
 #7 [ffff00002213bde0] __arm64_sys_futex at ffff000008184398
 #8 [ffff00002213be60] el0_svc_common at ffff0000080979ac
 #9 [ffff00002213bea0] el0_svc_handler at ffff000008097a6c
 #10 [ffff00002213bff0] el0_svc at ffff000008084044

rq->tmp_alone_branch introduced in 4.10, used to point to
the new beg of the list. If this cfs_rq is deleted somewhere
else, then the tmp_alone_branch will be illegal and cause
a list_add corruption.

(When enabled DEBUG_LIST, we fould this list_add corruption)

[ 2546.741103] list_add corruption. next->prev should be prev
(ffff800b4d61ad40), but was ffff800ba434fa38. (next=ffff800b6a95e740).
[ 2546.741130] ------------[ cut here ]------------
[ 2546.741132] kernel BUG at lib/list_debug.c:25!
[ 2546.741136] Internal error: Oops - BUG: 0 [#1] SMP
[ 2546.742870] CPU: 1 PID: 29428 Comm: docker-runc Kdump: loaded Tainted: G E  4.19.5-1.aarch64 #1
[ 2546.745415] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2546.747402] pstate: 40000085 (nZcv daIf -PAN -UAO)
[ 2546.749015] pc : __list_add_valid+0x50/0x90
[ 2546.750485] lr : __list_add_valid+0x50/0x90
[ 2546.751975] sp : ffff00001b5eb910
[ 2546.753286] x29: ffff00001b5eb910 x28: ffff800abacf0000
[ 2546.754976] x27: ffff00001b5ebbb0 x26: ffff000009570000
[ 2546.756665] x25: ffff00000960d000 x24: 00000250f41ca8f8
[ 2546.758366] x23: ffff800b6a95e740 x22: ffff800b4d61ad40
[ 2546.760066] x21: ffff800b4d61ad40 x20: ffff800ba434f080
[ 2546.761742] x19: ffff800b4d61ac00 x18: ffffffffffffffff
[ 2546.763425] x17: 0000000000000000 x16: 0000000000000000
[ 2546.765089] x15: ffff000009570748 x14: 6666662073617720
[ 2546.766755] x13: 747562202c293034 x12: 6461313664346230
[ 2546.768429] x11: 3038666666662820 x10: 0000000000000000
[ 2546.770124] x9 : 0000000000000001 x8 : ffff000009f34a0f
[ 2546.771831] x7 : 0000000000000000 x6 : 000000000000250d
[ 2546.773525] x5 : 0000000000000000 x4 : 0000000000000000
[ 2546.775227] x3 : 0000000000000000 x2 : 70ef7f624013ca00
[ 2546.776929] x1 : 0000000000000000 x0 : 0000000000000075
[ 2546.778623] Process docker-runc (pid: 29428, stack limit = 0x00000000293494a2)
[ 2546.780742] Call trace:
[ 2546.781955]  __list_add_valid+0x50/0x90
[ 2546.783469]  enqueue_entity+0x4a0/0x6e8
[ 2546.784957]  enqueue_task_fair+0xac/0x610
[ 2546.786502]  sched_move_task+0x134/0x178
[ 2546.787993]  cpu_cgroup_attach+0x40/0x78
[ 2546.789540]  cgroup_migrate_execute+0x378/0x3a8
[ 2546.791169]  cgroup_migrate+0x6c/0x90
[ 2546.792663]  cgroup_attach_task+0x148/0x238
[ 2546.794211]  __cgroup1_procs_write.isra.2+0xf8/0x160
[ 2546.795935]  cgroup1_procs_write+0x38/0x48
[ 2546.797492]  cgroup_file_write+0xa0/0x170
[ 2546.799010]  kernfs_fop_write+0x114/0x1e0
[ 2546.800558]  __vfs_write+0x60/0x190
[ 2546.801977]  vfs_write+0xac/0x1c0
[ 2546.803341]  ksys_write+0x6c/0xd8
[ 2546.804674]  __arm64_sys_write+0x24/0x30
[ 2546.806146]  el0_svc_common+0x78/0x100
[ 2546.807584]  el0_svc_handler+0x38/0x88
[ 2546.809017]  el0_svc+0x8/0xc

In this patch, we move rq->tmp_alone_branch point to its prev before delete it
from list.

Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Cc: Bin Li <huawei.libin@huawei.com>
Cc: <stable@vger.kernel.org>        [4.10+]
Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2..7a72702 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (cfs_rq->on_list) {
+		struct rq *rq = rq_of(cfs_rq);
+
+		if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
+			rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
+
 		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
 		cfs_rq->on_list = 0;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27  3:04 [PATCH] sched: fix infinity loop in update_blocked_averages Xie XiuQi
@ 2018-12-27  9:21 ` Vincent Guittot
  2018-12-27 10:21   ` Vincent Guittot
  2018-12-30 12:04   ` Ingo Molnar
  2018-12-30 13:00 ` [tip:sched/urgent] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c tip-bot for Linus Torvalds
  1 sibling, 2 replies; 27+ messages in thread
From: Vincent Guittot @ 2018-12-27  9:21 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Ingo Molnar, Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel

Hi Xie,

On Thu, 27 Dec 2018 at 03:57, Xie XiuQi <xiexiuqi@huawei.com> wrote:
>
> Zhepeng Xie report a bug, there is a infinity loop in
> update_blocked_averages().
>
> PID: 14233  TASK: ffff800b2de08fc0  CPU: 1   COMMAND: "docker"
>  #0 [ffff00002213b9d0] update_blocked_averages at ffff00000811e4a8
>  #1 [ffff00002213ba60] pick_next_task_fair at ffff00000812a3b4
>  #2 [ffff00002213baf0] __schedule at ffff000008deaa88
>  #3 [ffff00002213bb70] schedule at ffff000008deb1b8
>  #4 [ffff00002213bb80] futex_wait_queue_me at ffff000008180754
>  #5 [ffff00002213bbd0] futex_wait at ffff00000818192c
>  #6 [ffff00002213bd00] do_futex at ffff000008183ee4
>  #7 [ffff00002213bde0] __arm64_sys_futex at ffff000008184398
>  #8 [ffff00002213be60] el0_svc_common at ffff0000080979ac
>  #9 [ffff00002213bea0] el0_svc_handler at ffff000008097a6c
>  #10 [ffff00002213bff0] el0_svc at ffff000008084044
>
> rq->tmp_alone_branch introduced in 4.10, used to point to
> the new beg of the list. If this cfs_rq is deleted somewhere
> else, then the tmp_alone_branch will be illegal and cause
> a list_add corruption.

shouldn't all the sequence  be protected by rq_lock ?


>
> (When enabled DEBUG_LIST, we fould this list_add corruption)
>
> [ 2546.741103] list_add corruption. next->prev should be prev
> (ffff800b4d61ad40), but was ffff800ba434fa38. (next=ffff800b6a95e740).
> [ 2546.741130] ------------[ cut here ]------------
> [ 2546.741132] kernel BUG at lib/list_debug.c:25!
> [ 2546.741136] Internal error: Oops - BUG: 0 [#1] SMP
> [ 2546.742870] CPU: 1 PID: 29428 Comm: docker-runc Kdump: loaded Tainted: G E  4.19.5-1.aarch64 #1
> [ 2546.745415] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [ 2546.747402] pstate: 40000085 (nZcv daIf -PAN -UAO)
> [ 2546.749015] pc : __list_add_valid+0x50/0x90
> [ 2546.750485] lr : __list_add_valid+0x50/0x90
> [ 2546.751975] sp : ffff00001b5eb910
> [ 2546.753286] x29: ffff00001b5eb910 x28: ffff800abacf0000
> [ 2546.754976] x27: ffff00001b5ebbb0 x26: ffff000009570000
> [ 2546.756665] x25: ffff00000960d000 x24: 00000250f41ca8f8
> [ 2546.758366] x23: ffff800b6a95e740 x22: ffff800b4d61ad40
> [ 2546.760066] x21: ffff800b4d61ad40 x20: ffff800ba434f080
> [ 2546.761742] x19: ffff800b4d61ac00 x18: ffffffffffffffff
> [ 2546.763425] x17: 0000000000000000 x16: 0000000000000000
> [ 2546.765089] x15: ffff000009570748 x14: 6666662073617720
> [ 2546.766755] x13: 747562202c293034 x12: 6461313664346230
> [ 2546.768429] x11: 3038666666662820 x10: 0000000000000000
> [ 2546.770124] x9 : 0000000000000001 x8 : ffff000009f34a0f
> [ 2546.771831] x7 : 0000000000000000 x6 : 000000000000250d
> [ 2546.773525] x5 : 0000000000000000 x4 : 0000000000000000
> [ 2546.775227] x3 : 0000000000000000 x2 : 70ef7f624013ca00
> [ 2546.776929] x1 : 0000000000000000 x0 : 0000000000000075
> [ 2546.778623] Process docker-runc (pid: 29428, stack limit = 0x00000000293494a2)
> [ 2546.780742] Call trace:
> [ 2546.781955]  __list_add_valid+0x50/0x90
> [ 2546.783469]  enqueue_entity+0x4a0/0x6e8
> [ 2546.784957]  enqueue_task_fair+0xac/0x610
> [ 2546.786502]  sched_move_task+0x134/0x178
> [ 2546.787993]  cpu_cgroup_attach+0x40/0x78
> [ 2546.789540]  cgroup_migrate_execute+0x378/0x3a8
> [ 2546.791169]  cgroup_migrate+0x6c/0x90
> [ 2546.792663]  cgroup_attach_task+0x148/0x238
> [ 2546.794211]  __cgroup1_procs_write.isra.2+0xf8/0x160
> [ 2546.795935]  cgroup1_procs_write+0x38/0x48
> [ 2546.797492]  cgroup_file_write+0xa0/0x170
> [ 2546.799010]  kernfs_fop_write+0x114/0x1e0
> [ 2546.800558]  __vfs_write+0x60/0x190
> [ 2546.801977]  vfs_write+0xac/0x1c0
> [ 2546.803341]  ksys_write+0x6c/0xd8
> [ 2546.804674]  __arm64_sys_write+0x24/0x30
> [ 2546.806146]  el0_svc_common+0x78/0x100
> [ 2546.807584]  el0_svc_handler+0x38/0x88
> [ 2546.809017]  el0_svc+0x8/0xc
>

Have you got more details about the sequence that generates this bug ?
Is it easily reproducible ?

> In this patch, we move rq->tmp_alone_branch point to its prev before delete it
> from list.
>
> Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> Cc: Bin Li <huawei.libin@huawei.com>
> Cc: <stable@vger.kernel.org>        [4.10+]
> Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)

If it only happens in update_blocked_averages(), the del leaf has been added by:
a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)

> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac855b2..7a72702 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>         if (cfs_rq->on_list) {
> +               struct rq *rq = rq_of(cfs_rq);
> +
> +               if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> +                       rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> +
>                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
>                 cfs_rq->on_list = 0;
>         }
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27  9:21 ` Vincent Guittot
@ 2018-12-27 10:21   ` Vincent Guittot
  2018-12-27 10:23     ` Vincent Guittot
  2018-12-30 12:04   ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2018-12-27 10:21 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Ingo Molnar, Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel

Le Thursday 27 Dec 2018 à 10:21:53 (+0100), Vincent Guittot a écrit :
> Hi Xie,
> 
> On Thu, 27 Dec 2018 at 03:57, Xie XiuQi <xiexiuqi@huawei.com> wrote:
> >
> > Zhepeng Xie report a bug, there is a infinity loop in
> > update_blocked_averages().
> >
> > PID: 14233  TASK: ffff800b2de08fc0  CPU: 1   COMMAND: "docker"
> >  #0 [ffff00002213b9d0] update_blocked_averages at ffff00000811e4a8
> >  #1 [ffff00002213ba60] pick_next_task_fair at ffff00000812a3b4
> >  #2 [ffff00002213baf0] __schedule at ffff000008deaa88
> >  #3 [ffff00002213bb70] schedule at ffff000008deb1b8
> >  #4 [ffff00002213bb80] futex_wait_queue_me at ffff000008180754
> >  #5 [ffff00002213bbd0] futex_wait at ffff00000818192c
> >  #6 [ffff00002213bd00] do_futex at ffff000008183ee4
> >  #7 [ffff00002213bde0] __arm64_sys_futex at ffff000008184398
> >  #8 [ffff00002213be60] el0_svc_common at ffff0000080979ac
> >  #9 [ffff00002213bea0] el0_svc_handler at ffff000008097a6c
> >  #10 [ffff00002213bff0] el0_svc at ffff000008084044
> >
> > rq->tmp_alone_branch introduced in 4.10, used to point to
> > the new beg of the list. If this cfs_rq is deleted somewhere
> > else, then the tmp_alone_branch will be illegal and cause
> > a list_add corruption.
> 
> shouldn't all the sequence  be protected by rq_lock ?
> 
> 
> >
> > (When enabled DEBUG_LIST, we fould this list_add corruption)
> >
> > [ 2546.741103] list_add corruption. next->prev should be prev
> > (ffff800b4d61ad40), but was ffff800ba434fa38. (next=ffff800b6a95e740).
> > [ 2546.741130] ------------[ cut here ]------------
> > [ 2546.741132] kernel BUG at lib/list_debug.c:25!
> > [ 2546.741136] Internal error: Oops - BUG: 0 [#1] SMP
> > [ 2546.742870] CPU: 1 PID: 29428 Comm: docker-runc Kdump: loaded Tainted: G E  4.19.5-1.aarch64 #1
> > [ 2546.745415] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> > [ 2546.747402] pstate: 40000085 (nZcv daIf -PAN -UAO)
> > [ 2546.749015] pc : __list_add_valid+0x50/0x90
> > [ 2546.750485] lr : __list_add_valid+0x50/0x90
> > [ 2546.751975] sp : ffff00001b5eb910
> > [ 2546.753286] x29: ffff00001b5eb910 x28: ffff800abacf0000
> > [ 2546.754976] x27: ffff00001b5ebbb0 x26: ffff000009570000
> > [ 2546.756665] x25: ffff00000960d000 x24: 00000250f41ca8f8
> > [ 2546.758366] x23: ffff800b6a95e740 x22: ffff800b4d61ad40
> > [ 2546.760066] x21: ffff800b4d61ad40 x20: ffff800ba434f080
> > [ 2546.761742] x19: ffff800b4d61ac00 x18: ffffffffffffffff
> > [ 2546.763425] x17: 0000000000000000 x16: 0000000000000000
> > [ 2546.765089] x15: ffff000009570748 x14: 6666662073617720
> > [ 2546.766755] x13: 747562202c293034 x12: 6461313664346230
> > [ 2546.768429] x11: 3038666666662820 x10: 0000000000000000
> > [ 2546.770124] x9 : 0000000000000001 x8 : ffff000009f34a0f
> > [ 2546.771831] x7 : 0000000000000000 x6 : 000000000000250d
> > [ 2546.773525] x5 : 0000000000000000 x4 : 0000000000000000
> > [ 2546.775227] x3 : 0000000000000000 x2 : 70ef7f624013ca00
> > [ 2546.776929] x1 : 0000000000000000 x0 : 0000000000000075
> > [ 2546.778623] Process docker-runc (pid: 29428, stack limit = 0x00000000293494a2)
> > [ 2546.780742] Call trace:
> > [ 2546.781955]  __list_add_valid+0x50/0x90
> > [ 2546.783469]  enqueue_entity+0x4a0/0x6e8
> > [ 2546.784957]  enqueue_task_fair+0xac/0x610
> > [ 2546.786502]  sched_move_task+0x134/0x178
> > [ 2546.787993]  cpu_cgroup_attach+0x40/0x78
> > [ 2546.789540]  cgroup_migrate_execute+0x378/0x3a8
> > [ 2546.791169]  cgroup_migrate+0x6c/0x90
> > [ 2546.792663]  cgroup_attach_task+0x148/0x238
> > [ 2546.794211]  __cgroup1_procs_write.isra.2+0xf8/0x160
> > [ 2546.795935]  cgroup1_procs_write+0x38/0x48
> > [ 2546.797492]  cgroup_file_write+0xa0/0x170
> > [ 2546.799010]  kernfs_fop_write+0x114/0x1e0
> > [ 2546.800558]  __vfs_write+0x60/0x190
> > [ 2546.801977]  vfs_write+0xac/0x1c0
> > [ 2546.803341]  ksys_write+0x6c/0xd8
> > [ 2546.804674]  __arm64_sys_write+0x24/0x30
> > [ 2546.806146]  el0_svc_common+0x78/0x100
> > [ 2546.807584]  el0_svc_handler+0x38/0x88
> > [ 2546.809017]  el0_svc+0x8/0xc
> >
> 
> Have you got more details about the sequence that generates this bug ?
> Is it easily reproducible ?
> 
> > In this patch, we move rq->tmp_alone_branch point to its prev before delete it
> > from list.
> >
> > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > Cc: Bin Li <huawei.libin@huawei.com>
> > Cc: <stable@vger.kernel.org>        [4.10+]
> > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> 
> If it only happens in update_blocked_averages(), the del leaf has been added by:
> a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
> 
> > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > ---
> >  kernel/sched/fair.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ac855b2..7a72702 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >         if (cfs_rq->on_list) {
> > +               struct rq *rq = rq_of(cfs_rq);
> > +
> > +               if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > +                       rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > +

I'm afraid that your patch will break the ordering of leaf_cfs_rq_list

Can you tried the patch below:

---
 kernel/sched/fair.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ca46964..4d51b2d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7694,13 +7694,6 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
-		/*
-		 * There can be a lot of idle CPU cgroups.  Don't let fully
-		 * decayed cfs_rqs linger on the list.
-		 */
-		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
-
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
-- 
2.7.4




> >                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> >                 cfs_rq->on_list = 0;
> >         }
> > --
> > 1.8.3.1
> >

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 10:21   ` Vincent Guittot
@ 2018-12-27 10:23     ` Vincent Guittot
  2018-12-27 16:39       ` Sargun Dhillon
  0 siblings, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2018-12-27 10:23 UTC (permalink / raw)
  To: Xie XiuQi
  Cc: Ingo Molnar, Peter Zijlstra, xiezhipeng1, huawei.libin,
	linux-kernel, Sargun Dhillon, dmitry.adamushko, Tejun Heo

Adding Sargun and Dimitry who faced similar problem
Adding Tejun

On Thu, 27 Dec 2018 at 11:21, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Le Thursday 27 Dec 2018 à 10:21:53 (+0100), Vincent Guittot a écrit :
> > Hi Xie,
> >
> > On Thu, 27 Dec 2018 at 03:57, Xie XiuQi <xiexiuqi@huawei.com> wrote:
> > >
> > > Zhepeng Xie report a bug, there is a infinity loop in
> > > update_blocked_averages().
> > >
> > > PID: 14233  TASK: ffff800b2de08fc0  CPU: 1   COMMAND: "docker"
> > >  #0 [ffff00002213b9d0] update_blocked_averages at ffff00000811e4a8
> > >  #1 [ffff00002213ba60] pick_next_task_fair at ffff00000812a3b4
> > >  #2 [ffff00002213baf0] __schedule at ffff000008deaa88
> > >  #3 [ffff00002213bb70] schedule at ffff000008deb1b8
> > >  #4 [ffff00002213bb80] futex_wait_queue_me at ffff000008180754
> > >  #5 [ffff00002213bbd0] futex_wait at ffff00000818192c
> > >  #6 [ffff00002213bd00] do_futex at ffff000008183ee4
> > >  #7 [ffff00002213bde0] __arm64_sys_futex at ffff000008184398
> > >  #8 [ffff00002213be60] el0_svc_common at ffff0000080979ac
> > >  #9 [ffff00002213bea0] el0_svc_handler at ffff000008097a6c
> > >  #10 [ffff00002213bff0] el0_svc at ffff000008084044
> > >
> > > rq->tmp_alone_branch introduced in 4.10, used to point to
> > > the new beg of the list. If this cfs_rq is deleted somewhere
> > > else, then the tmp_alone_branch will be illegal and cause
> > > a list_add corruption.
> >
> > shouldn't all the sequence  be protected by rq_lock ?
> >
> >
> > >
> > > (When enabled DEBUG_LIST, we fould this list_add corruption)
> > >
> > > [ 2546.741103] list_add corruption. next->prev should be prev
> > > (ffff800b4d61ad40), but was ffff800ba434fa38. (next=ffff800b6a95e740).
> > > [ 2546.741130] ------------[ cut here ]------------
> > > [ 2546.741132] kernel BUG at lib/list_debug.c:25!
> > > [ 2546.741136] Internal error: Oops - BUG: 0 [#1] SMP
> > > [ 2546.742870] CPU: 1 PID: 29428 Comm: docker-runc Kdump: loaded Tainted: G E  4.19.5-1.aarch64 #1
> > > [ 2546.745415] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> > > [ 2546.747402] pstate: 40000085 (nZcv daIf -PAN -UAO)
> > > [ 2546.749015] pc : __list_add_valid+0x50/0x90
> > > [ 2546.750485] lr : __list_add_valid+0x50/0x90
> > > [ 2546.751975] sp : ffff00001b5eb910
> > > [ 2546.753286] x29: ffff00001b5eb910 x28: ffff800abacf0000
> > > [ 2546.754976] x27: ffff00001b5ebbb0 x26: ffff000009570000
> > > [ 2546.756665] x25: ffff00000960d000 x24: 00000250f41ca8f8
> > > [ 2546.758366] x23: ffff800b6a95e740 x22: ffff800b4d61ad40
> > > [ 2546.760066] x21: ffff800b4d61ad40 x20: ffff800ba434f080
> > > [ 2546.761742] x19: ffff800b4d61ac00 x18: ffffffffffffffff
> > > [ 2546.763425] x17: 0000000000000000 x16: 0000000000000000
> > > [ 2546.765089] x15: ffff000009570748 x14: 6666662073617720
> > > [ 2546.766755] x13: 747562202c293034 x12: 6461313664346230
> > > [ 2546.768429] x11: 3038666666662820 x10: 0000000000000000
> > > [ 2546.770124] x9 : 0000000000000001 x8 : ffff000009f34a0f
> > > [ 2546.771831] x7 : 0000000000000000 x6 : 000000000000250d
> > > [ 2546.773525] x5 : 0000000000000000 x4 : 0000000000000000
> > > [ 2546.775227] x3 : 0000000000000000 x2 : 70ef7f624013ca00
> > > [ 2546.776929] x1 : 0000000000000000 x0 : 0000000000000075
> > > [ 2546.778623] Process docker-runc (pid: 29428, stack limit = 0x00000000293494a2)
> > > [ 2546.780742] Call trace:
> > > [ 2546.781955]  __list_add_valid+0x50/0x90
> > > [ 2546.783469]  enqueue_entity+0x4a0/0x6e8
> > > [ 2546.784957]  enqueue_task_fair+0xac/0x610
> > > [ 2546.786502]  sched_move_task+0x134/0x178
> > > [ 2546.787993]  cpu_cgroup_attach+0x40/0x78
> > > [ 2546.789540]  cgroup_migrate_execute+0x378/0x3a8
> > > [ 2546.791169]  cgroup_migrate+0x6c/0x90
> > > [ 2546.792663]  cgroup_attach_task+0x148/0x238
> > > [ 2546.794211]  __cgroup1_procs_write.isra.2+0xf8/0x160
> > > [ 2546.795935]  cgroup1_procs_write+0x38/0x48
> > > [ 2546.797492]  cgroup_file_write+0xa0/0x170
> > > [ 2546.799010]  kernfs_fop_write+0x114/0x1e0
> > > [ 2546.800558]  __vfs_write+0x60/0x190
> > > [ 2546.801977]  vfs_write+0xac/0x1c0
> > > [ 2546.803341]  ksys_write+0x6c/0xd8
> > > [ 2546.804674]  __arm64_sys_write+0x24/0x30
> > > [ 2546.806146]  el0_svc_common+0x78/0x100
> > > [ 2546.807584]  el0_svc_handler+0x38/0x88
> > > [ 2546.809017]  el0_svc+0x8/0xc
> > >
> >
> > Have you got more details about the sequence that generates this bug ?
> > Is it easily reproducible ?
> >
> > > In this patch, we move rq->tmp_alone_branch point to its prev before delete it
> > > from list.
> > >
> > > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > Cc: Bin Li <huawei.libin@huawei.com>
> > > Cc: <stable@vger.kernel.org>        [4.10+]
> > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> >
> > If it only happens in update_blocked_averages(), the del leaf has been added by:
> > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
> >
> > > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > > Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > ---
> > >  kernel/sched/fair.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index ac855b2..7a72702 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > >  {
> > >         if (cfs_rq->on_list) {
> > > +               struct rq *rq = rq_of(cfs_rq);
> > > +
> > > +               if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > > +                       rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > > +
>
> I'm afraid that your patch will break the ordering of leaf_cfs_rq_list
>
> Can you tried the patch below:
>
> ---
>  kernel/sched/fair.c | 7 -------
>  1 file changed, 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ca46964..4d51b2d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7694,13 +7694,6 @@ static void update_blocked_averages(int cpu)
>                 if (se && !skip_blocked_update(se))
>                         update_load_avg(cfs_rq_of(se), se, 0);
>
> -               /*
> -                * There can be a lot of idle CPU cgroups.  Don't let fully
> -                * decayed cfs_rqs linger on the list.
> -                */
> -               if (cfs_rq_is_decayed(cfs_rq))
> -                       list_del_leaf_cfs_rq(cfs_rq);
> -
>                 /* Don't need periodic decay once load/util_avg are null */
>                 if (cfs_rq_has_blocked(cfs_rq))
>                         done = false;
> --
> 2.7.4
>
>
>
>
> > >                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > >                 cfs_rq->on_list = 0;
> > >         }
> > > --
> > > 1.8.3.1
> > >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 10:23     ` Vincent Guittot
@ 2018-12-27 16:39       ` Sargun Dhillon
  2018-12-27 17:01         ` Vincent Guittot
  0 siblings, 1 reply; 27+ messages in thread
From: Sargun Dhillon @ 2018-12-27 16:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Xie XiuQi, Ingo Molnar, Peter Zijlstra, xiezhipeng1,
	huawei.libin, linux-kernel, dmitry.adamushko, Tejun Heo

On Thu, Dec 27, 2018 at 5:23 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Adding Sargun and Dimitry who faced similar problem
> Adding Tejun
>
> On Thu, 27 Dec 2018 at 11:21, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > Le Thursday 27 Dec 2018 à 10:21:53 (+0100), Vincent Guittot a écrit :
> > > Hi Xie,
> > >
> > > On Thu, 27 Dec 2018 at 03:57, Xie XiuQi <xiexiuqi@huawei.com> wrote:
> > > >
> > > > Zhepeng Xie report a bug, there is a infinity loop in
> > > > update_blocked_averages().
> > > >
> > > > PID: 14233  TASK: ffff800b2de08fc0  CPU: 1   COMMAND: "docker"
> > > >  #0 [ffff00002213b9d0] update_blocked_averages at ffff00000811e4a8
> > > >  #1 [ffff00002213ba60] pick_next_task_fair at ffff00000812a3b4
> > > >  #2 [ffff00002213baf0] __schedule at ffff000008deaa88
> > > >  #3 [ffff00002213bb70] schedule at ffff000008deb1b8
> > > >  #4 [ffff00002213bb80] futex_wait_queue_me at ffff000008180754
> > > >  #5 [ffff00002213bbd0] futex_wait at ffff00000818192c
> > > >  #6 [ffff00002213bd00] do_futex at ffff000008183ee4
> > > >  #7 [ffff00002213bde0] __arm64_sys_futex at ffff000008184398
> > > >  #8 [ffff00002213be60] el0_svc_common at ffff0000080979ac
> > > >  #9 [ffff00002213bea0] el0_svc_handler at ffff000008097a6c
> > > >  #10 [ffff00002213bff0] el0_svc at ffff000008084044
> > > >
> > > > rq->tmp_alone_branch introduced in 4.10, used to point to
> > > > the new beg of the list. If this cfs_rq is deleted somewhere
> > > > else, then the tmp_alone_branch will be illegal and cause
> > > > a list_add corruption.
> > >
> > > shouldn't all the sequence  be protected by rq_lock ?
> > >
> > >
> > > >
> > > > (When enabled DEBUG_LIST, we fould this list_add corruption)
> > > >
> > > > [ 2546.741103] list_add corruption. next->prev should be prev
> > > > (ffff800b4d61ad40), but was ffff800ba434fa38. (next=ffff800b6a95e740).
> > > > [ 2546.741130] ------------[ cut here ]------------
> > > > [ 2546.741132] kernel BUG at lib/list_debug.c:25!
> > > > [ 2546.741136] Internal error: Oops - BUG: 0 [#1] SMP
> > > > [ 2546.742870] CPU: 1 PID: 29428 Comm: docker-runc Kdump: loaded Tainted: G E  4.19.5-1.aarch64 #1
> > > > [ 2546.745415] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> > > > [ 2546.747402] pstate: 40000085 (nZcv daIf -PAN -UAO)
> > > > [ 2546.749015] pc : __list_add_valid+0x50/0x90
> > > > [ 2546.750485] lr : __list_add_valid+0x50/0x90
> > > > [ 2546.751975] sp : ffff00001b5eb910
> > > > [ 2546.753286] x29: ffff00001b5eb910 x28: ffff800abacf0000
> > > > [ 2546.754976] x27: ffff00001b5ebbb0 x26: ffff000009570000
> > > > [ 2546.756665] x25: ffff00000960d000 x24: 00000250f41ca8f8
> > > > [ 2546.758366] x23: ffff800b6a95e740 x22: ffff800b4d61ad40
> > > > [ 2546.760066] x21: ffff800b4d61ad40 x20: ffff800ba434f080
> > > > [ 2546.761742] x19: ffff800b4d61ac00 x18: ffffffffffffffff
> > > > [ 2546.763425] x17: 0000000000000000 x16: 0000000000000000
> > > > [ 2546.765089] x15: ffff000009570748 x14: 6666662073617720
> > > > [ 2546.766755] x13: 747562202c293034 x12: 6461313664346230
> > > > [ 2546.768429] x11: 3038666666662820 x10: 0000000000000000
> > > > [ 2546.770124] x9 : 0000000000000001 x8 : ffff000009f34a0f
> > > > [ 2546.771831] x7 : 0000000000000000 x6 : 000000000000250d
> > > > [ 2546.773525] x5 : 0000000000000000 x4 : 0000000000000000
> > > > [ 2546.775227] x3 : 0000000000000000 x2 : 70ef7f624013ca00
> > > > [ 2546.776929] x1 : 0000000000000000 x0 : 0000000000000075
> > > > [ 2546.778623] Process docker-runc (pid: 29428, stack limit = 0x00000000293494a2)
> > > > [ 2546.780742] Call trace:
> > > > [ 2546.781955]  __list_add_valid+0x50/0x90
> > > > [ 2546.783469]  enqueue_entity+0x4a0/0x6e8
> > > > [ 2546.784957]  enqueue_task_fair+0xac/0x610
> > > > [ 2546.786502]  sched_move_task+0x134/0x178
> > > > [ 2546.787993]  cpu_cgroup_attach+0x40/0x78
> > > > [ 2546.789540]  cgroup_migrate_execute+0x378/0x3a8
> > > > [ 2546.791169]  cgroup_migrate+0x6c/0x90
> > > > [ 2546.792663]  cgroup_attach_task+0x148/0x238
> > > > [ 2546.794211]  __cgroup1_procs_write.isra.2+0xf8/0x160
> > > > [ 2546.795935]  cgroup1_procs_write+0x38/0x48
> > > > [ 2546.797492]  cgroup_file_write+0xa0/0x170
> > > > [ 2546.799010]  kernfs_fop_write+0x114/0x1e0
> > > > [ 2546.800558]  __vfs_write+0x60/0x190
> > > > [ 2546.801977]  vfs_write+0xac/0x1c0
> > > > [ 2546.803341]  ksys_write+0x6c/0xd8
> > > > [ 2546.804674]  __arm64_sys_write+0x24/0x30
> > > > [ 2546.806146]  el0_svc_common+0x78/0x100
> > > > [ 2546.807584]  el0_svc_handler+0x38/0x88
> > > > [ 2546.809017]  el0_svc+0x8/0xc
> > > >
> > >
> > > Have you got more details about the sequence that generates this bug ?
> > > Is it easily reproducible ?
> > >
> > > > In this patch, we move rq->tmp_alone_branch point to its prev before delete it
> > > > from list.
> > > >
> > > > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > > Cc: Bin Li <huawei.libin@huawei.com>
> > > > Cc: <stable@vger.kernel.org>        [4.10+]
> > > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> > >
> > > If it only happens in update_blocked_averages(), the del leaf has been added by:
> > > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
> > >
> > > > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > > > Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > > ---
> > > >  kernel/sched/fair.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index ac855b2..7a72702 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > > >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > > >  {
> > > >         if (cfs_rq->on_list) {
> > > > +               struct rq *rq = rq_of(cfs_rq);
> > > > +
> > > > +               if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > > > +                       rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > > > +
> >
> > I'm afraid that your patch will break the ordering of leaf_cfs_rq_list
> >
> > Can you tried the patch below:
> >
> > ---
> >  kernel/sched/fair.c | 7 -------
> >  1 file changed, 7 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ca46964..4d51b2d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7694,13 +7694,6 @@ static void update_blocked_averages(int cpu)
> >                 if (se && !skip_blocked_update(se))
> >                         update_load_avg(cfs_rq_of(se), se, 0);
> >
> > -               /*
> > -                * There can be a lot of idle CPU cgroups.  Don't let fully
> > -                * decayed cfs_rqs linger on the list.
> > -                */
> > -               if (cfs_rq_is_decayed(cfs_rq))
> > -                       list_del_leaf_cfs_rq(cfs_rq);
> > -
> >                 /* Don't need periodic decay once load/util_avg are null */
> >                 if (cfs_rq_has_blocked(cfs_rq))
> >                         done = false;
> > --
> > 2.7.4
> >
> >
> >
> >
Tested-by: Sargun Dhillon <sargun@sargun.me>

This patch fixes things for me. I imagine this code that's being
removed has a purpose?
> > > >                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > > >                 cfs_rq->on_list = 0;
> > > >         }
> > > > --
> > > > 1.8.3.1
> > > >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 16:39       ` Sargun Dhillon
@ 2018-12-27 17:01         ` Vincent Guittot
  2018-12-27 18:15           ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2018-12-27 17:01 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Xie XiuQi, Ingo Molnar, Peter Zijlstra, xiezhipeng1,
	huawei.libin, linux-kernel, Dmitry Adamushko, Tejun Heo

On Thu, 27 Dec 2018 at 17:40, Sargun Dhillon <sargun@sargun.me> wrote:
>
> On Thu, Dec 27, 2018 at 5:23 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > Adding Sargun and Dimitry who faced similar problem
> > Adding Tejun
> >
> > On Thu, 27 Dec 2018 at 11:21, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > Le Thursday 27 Dec 2018 à 10:21:53 (+0100), Vincent Guittot a écrit :
> > > > Hi Xie,
> > > >
> > > > On Thu, 27 Dec 2018 at 03:57, Xie XiuQi <xiexiuqi@huawei.com> wrote:
> > > > >
> > > > > Zhepeng Xie report a bug, there is a infinity loop in
> > > > > update_blocked_averages().
> > > > >
> > > > > PID: 14233  TASK: ffff800b2de08fc0  CPU: 1   COMMAND: "docker"
> > > > >  #0 [ffff00002213b9d0] update_blocked_averages at ffff00000811e4a8
> > > > >  #1 [ffff00002213ba60] pick_next_task_fair at ffff00000812a3b4
> > > > >  #2 [ffff00002213baf0] __schedule at ffff000008deaa88
> > > > >  #3 [ffff00002213bb70] schedule at ffff000008deb1b8
> > > > >  #4 [ffff00002213bb80] futex_wait_queue_me at ffff000008180754
> > > > >  #5 [ffff00002213bbd0] futex_wait at ffff00000818192c
> > > > >  #6 [ffff00002213bd00] do_futex at ffff000008183ee4
> > > > >  #7 [ffff00002213bde0] __arm64_sys_futex at ffff000008184398
> > > > >  #8 [ffff00002213be60] el0_svc_common at ffff0000080979ac
> > > > >  #9 [ffff00002213bea0] el0_svc_handler at ffff000008097a6c
> > > > >  #10 [ffff00002213bff0] el0_svc at ffff000008084044
> > > > >
> > > > > rq->tmp_alone_branch introduced in 4.10, used to point to
> > > > > the new beg of the list. If this cfs_rq is deleted somewhere
> > > > > else, then the tmp_alone_branch will be illegal and cause
> > > > > a list_add corruption.
> > > >
> > > > shouldn't all the sequence  be protected by rq_lock ?
> > > >
> > > >
> > > > >
> > > > > (When enabled DEBUG_LIST, we fould this list_add corruption)
> > > > >
> > > > > [ 2546.741103] list_add corruption. next->prev should be prev
> > > > > (ffff800b4d61ad40), but was ffff800ba434fa38. (next=ffff800b6a95e740).
> > > > > [ 2546.741130] ------------[ cut here ]------------
> > > > > [ 2546.741132] kernel BUG at lib/list_debug.c:25!
> > > > > [ 2546.741136] Internal error: Oops - BUG: 0 [#1] SMP
> > > > > [ 2546.742870] CPU: 1 PID: 29428 Comm: docker-runc Kdump: loaded Tainted: G E  4.19.5-1.aarch64 #1
> > > > > [ 2546.745415] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> > > > > [ 2546.747402] pstate: 40000085 (nZcv daIf -PAN -UAO)
> > > > > [ 2546.749015] pc : __list_add_valid+0x50/0x90
> > > > > [ 2546.750485] lr : __list_add_valid+0x50/0x90
> > > > > [ 2546.751975] sp : ffff00001b5eb910
> > > > > [ 2546.753286] x29: ffff00001b5eb910 x28: ffff800abacf0000
> > > > > [ 2546.754976] x27: ffff00001b5ebbb0 x26: ffff000009570000
> > > > > [ 2546.756665] x25: ffff00000960d000 x24: 00000250f41ca8f8
> > > > > [ 2546.758366] x23: ffff800b6a95e740 x22: ffff800b4d61ad40
> > > > > [ 2546.760066] x21: ffff800b4d61ad40 x20: ffff800ba434f080
> > > > > [ 2546.761742] x19: ffff800b4d61ac00 x18: ffffffffffffffff
> > > > > [ 2546.763425] x17: 0000000000000000 x16: 0000000000000000
> > > > > [ 2546.765089] x15: ffff000009570748 x14: 6666662073617720
> > > > > [ 2546.766755] x13: 747562202c293034 x12: 6461313664346230
> > > > > [ 2546.768429] x11: 3038666666662820 x10: 0000000000000000
> > > > > [ 2546.770124] x9 : 0000000000000001 x8 : ffff000009f34a0f
> > > > > [ 2546.771831] x7 : 0000000000000000 x6 : 000000000000250d
> > > > > [ 2546.773525] x5 : 0000000000000000 x4 : 0000000000000000
> > > > > [ 2546.775227] x3 : 0000000000000000 x2 : 70ef7f624013ca00
> > > > > [ 2546.776929] x1 : 0000000000000000 x0 : 0000000000000075
> > > > > [ 2546.778623] Process docker-runc (pid: 29428, stack limit = 0x00000000293494a2)
> > > > > [ 2546.780742] Call trace:
> > > > > [ 2546.781955]  __list_add_valid+0x50/0x90
> > > > > [ 2546.783469]  enqueue_entity+0x4a0/0x6e8
> > > > > [ 2546.784957]  enqueue_task_fair+0xac/0x610
> > > > > [ 2546.786502]  sched_move_task+0x134/0x178
> > > > > [ 2546.787993]  cpu_cgroup_attach+0x40/0x78
> > > > > [ 2546.789540]  cgroup_migrate_execute+0x378/0x3a8
> > > > > [ 2546.791169]  cgroup_migrate+0x6c/0x90
> > > > > [ 2546.792663]  cgroup_attach_task+0x148/0x238
> > > > > [ 2546.794211]  __cgroup1_procs_write.isra.2+0xf8/0x160
> > > > > [ 2546.795935]  cgroup1_procs_write+0x38/0x48
> > > > > [ 2546.797492]  cgroup_file_write+0xa0/0x170
> > > > > [ 2546.799010]  kernfs_fop_write+0x114/0x1e0
> > > > > [ 2546.800558]  __vfs_write+0x60/0x190
> > > > > [ 2546.801977]  vfs_write+0xac/0x1c0
> > > > > [ 2546.803341]  ksys_write+0x6c/0xd8
> > > > > [ 2546.804674]  __arm64_sys_write+0x24/0x30
> > > > > [ 2546.806146]  el0_svc_common+0x78/0x100
> > > > > [ 2546.807584]  el0_svc_handler+0x38/0x88
> > > > > [ 2546.809017]  el0_svc+0x8/0xc
> > > > >
> > > >
> > > > Have you got more details about the sequence that generates this bug ?
> > > > Is it easily reproducible ?
> > > >
> > > > > In this patch, we move rq->tmp_alone_branch point to its prev before delete it
> > > > > from list.
> > > > >
> > > > > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > > > Cc: Bin Li <huawei.libin@huawei.com>
> > > > > Cc: <stable@vger.kernel.org>        [4.10+]
> > > > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> > > >
> > > > If it only happens in update_blocked_averages(), the del leaf has been added by:
> > > > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
> > > >
> > > > > Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
> > > > > Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > > > ---
> > > > >  kernel/sched/fair.c | 5 +++++
> > > > >  1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index ac855b2..7a72702 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > > > >  static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > > > >  {
> > > > >         if (cfs_rq->on_list) {
> > > > > +               struct rq *rq = rq_of(cfs_rq);
> > > > > +
> > > > > +               if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > > > > +                       rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
> > > > > +
> > >
> > > I'm afraid that your patch will break the ordering of leaf_cfs_rq_list
> > >
> > > Can you tried the patch below:
> > >
> > > ---
> > >  kernel/sched/fair.c | 7 -------
> > >  1 file changed, 7 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index ca46964..4d51b2d 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7694,13 +7694,6 @@ static void update_blocked_averages(int cpu)
> > >                 if (se && !skip_blocked_update(se))
> > >                         update_load_avg(cfs_rq_of(se), se, 0);
> > >
> > > -               /*
> > > -                * There can be a lot of idle CPU cgroups.  Don't let fully
> > > -                * decayed cfs_rqs linger on the list.
> > > -                */
> > > -               if (cfs_rq_is_decayed(cfs_rq))
> > > -                       list_del_leaf_cfs_rq(cfs_rq);
> > > -
> > >                 /* Don't need periodic decay once load/util_avg are null */
> > >                 if (cfs_rq_has_blocked(cfs_rq))
> > >                         done = false;
> > > --
> > > 2.7.4
> > >
> > >
> > >
> > >
> Tested-by: Sargun Dhillon <sargun@sargun.me>

Thanks

>
> This patch fixes things for me. I imagine this code that's being
> removed has a purpose?

In the original behavior, the cs_rq was removed from the list only
when the cgroup was removed.
patch a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance
path) has added an optimization which remove the cfs_rq when there
were no blocked load to update in order to optimize the loop but it
has introduced a race condition that create this infinite loop. The
patch fixes the problem by removing the optimization.
I will look at re-adding the optimization once i will have afix for
the race condition


> > > > >                 list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > > > >                 cfs_rq->on_list = 0;
> > > > >         }
> > > > > --
> > > > > 1.8.3.1
> > > > >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 17:01         ` Vincent Guittot
@ 2018-12-27 18:15           ` Linus Torvalds
  2018-12-27 21:08             ` Sargun Dhillon
  2018-12-28  1:15             ` Tejun Heo
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2018-12-27 18:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Sargun Dhillon, Xie XiuQi, Ingo Molnar, Peter Zijlstra,
	xiezhipeng1, huawei.libin, linux-kernel, Dmitry Adamushko,
	Tejun Heo

On Thu, Dec 27, 2018 at 9:02 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> In the original behavior, the cs_rq was removed from the list only
> when the cgroup was removed.
> patch a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance
> path) has added an optimization which remove the cfs_rq when there
> were no blocked load to update in order to optimize the loop but it
> has introduced a race condition that create this infinite loop. The
> patch fixes the problem by removing the optimization.
> I will look at re-adding the optimization once i will have afix for
> the race condition

Hmm. What's the race? We seem to take the rq lock for all the cases,
but maybe I'm missing something?

That commit a9e7f6544b9c is a year and a half old, why did this start
being reported now?

[ goes off and looks ]

Oh. unthrottle_cfs_rq -> enqueue_entity -> list_add_leaf_cfs_rq()
doesn't actually seem to hold the rq lock at all. It's just called
under a rcu read lock.

So it all seems to depend on that "on_list" flag for exclusion. Which
seems fundamentally racy, since it's not protected by a lock.

So yeah, the whole logic seems to depend on "on_list is sticky and
stays set until the whole task group is destroyed".

So commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance
path") would appear to be entirely wrong, because on_list isn't
actually protected by a lock, and that can confuse things.

But that still makes me go "how come is this only noticed 18 months
after the fact"?

So I'm probably still missing something.

Tejun? PeterZ? Tell my why I'm being dense.

                   Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 18:15           ` Linus Torvalds
@ 2018-12-27 21:08             ` Sargun Dhillon
  2018-12-27 21:46               ` Linus Torvalds
  2018-12-28  1:15             ` Tejun Heo
  1 sibling, 1 reply; 27+ messages in thread
From: Sargun Dhillon @ 2018-12-27 21:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, Xie XiuQi, Ingo Molnar, Peter Zijlstra,
	xiezhipeng1, huawei.libin, linux-kernel, Dmitry Adamushko,
	Tejun Heo

On Thu, Dec 27, 2018 at 1:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Dec 27, 2018 at 9:02 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > In the original behavior, the cs_rq was removed from the list only
> > when the cgroup was removed.
> > patch a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance
> > path) has added an optimization which remove the cfs_rq when there
> > were no blocked load to update in order to optimize the loop but it
> > has introduced a race condition that create this infinite loop. The
> > patch fixes the problem by removing the optimization.
> > I will look at re-adding the optimization once i will have afix for
> > the race condition
>
> Hmm. What's the race? We seem to take the rq lock for all the cases,
> but maybe I'm missing something?
>
> That commit a9e7f6544b9c is a year and a half old, why did this start
> being reported now?
>
This appears to be broken since October on 4.18.5. We've only noticed
it recently with a workload which does ridiculously parallel compiles
in cgroups that are rapidly churned.

It's also an awkward bug to catch, because none of the lockup
detectors, were catching it in our environment. The only reason we
caught it was that it was blocking other cores, and those other cores
were missing IPIs, resulting in catastrophic failure.

> [ goes off and looks ]
>
> Oh. unthrottle_cfs_rq -> enqueue_entity -> list_add_leaf_cfs_rq()
> doesn't actually seem to hold the rq lock at all. It's just called
> under a rcu read lock.
>
> So it all seems to depend on that "on_list" flag for exclusion. Which
> seems fundamentally racy, since it's not protected by a lock.
>
> So yeah, the whole logic seems to depend on "on_list is sticky and
> stays set until the whole task group is destroyed".
>
> So commit a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance
> path") would appear to be entirely wrong, because on_list isn't
> actually protected by a lock, and that can confuse things.
>
> But that still makes me go "how come is this only noticed 18 months
> after the fact"?
>
> So I'm probably still missing something.
>
> Tejun? PeterZ? Tell my why I'm being dense.
>
>                    Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 21:08             ` Sargun Dhillon
@ 2018-12-27 21:46               ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2018-12-27 21:46 UTC (permalink / raw)
  To: Sargun Dhillon
  Cc: Vincent Guittot, Xie XiuQi, Ingo Molnar, Peter Zijlstra,
	xiezhipeng1, huawei.libin, linux-kernel, Dmitry Adamushko,
	Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]

On Thu, Dec 27, 2018 at 1:09 PM Sargun Dhillon <sargun@sargun.me> wrote:
>
> This appears to be broken since October on 4.18.5. We've only noticed
> it recently with a workload which does ridiculously parallel compiles
> in cgroups that are rapidly churned.

Yeah, that's probably unusual enough that people will have missed it.

Because it really looks like the bug has been there since 4.13, unless
I'm mis-reading things. Other things have changed there since, so
maybe I am.

> It's also an awkward bug to catch, because none of the lockup
> detectors, were catching it in our environment. The only reason we
> caught it was that it was blocking other cores, and those other cores
> were missing IPIs, resulting in catastrophic failure.

My gut feel is that we just need to revert that commit. It doesn't
revert clealy, but it doesn't look hard to do manually.

Something like the attached?

But we do need Tejun and PeterZ to take a look, since there might be
something subtle going on.

Everybody is probably still on well-deserved vacations, so it might be
a while. But testing the attached patch is probably a good idea
regardless.

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 2945 bytes --]

 kernel/sched/fair.c | 41 ++++++++---------------------------------
 1 file changed, 8 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1907506318a..01f3cb89d188 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -353,9 +353,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 }
 
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
-	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
-				 leaf_cfs_rq_list)
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 }
 
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
-		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
+#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7647,27 +7646,10 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load.weight)
-		return false;
-
-	if (cfs_rq->avg.load_sum)
-		return false;
-
-	if (cfs_rq->avg.util_sum)
-		return false;
-
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
-	return true;
-}
-
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7679,7 +7661,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
 		/* throttled entities do not contribute to load */
@@ -7694,13 +7676,6 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
-		/*
-		 * There can be a lot of idle CPU cgroups.  Don't let fully
-		 * decayed cfs_rqs linger on the list.
-		 */
-		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
-
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10570,10 +10545,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27 18:15           ` Linus Torvalds
  2018-12-27 21:08             ` Sargun Dhillon
@ 2018-12-28  1:15             ` Tejun Heo
  2018-12-28  1:36               ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2018-12-28  1:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

Happy holidays, everyone.

(cc'ing Rik, who has been looking at the scheduler code a lot lately)

On Thu, Dec 27, 2018 at 10:15:17AM -0800, Linus Torvalds wrote:
> [ goes off and looks ]
> 
> Oh. unthrottle_cfs_rq -> enqueue_entity -> list_add_leaf_cfs_rq()
> doesn't actually seem to hold the rq lock at all. It's just called
> under a rcu read lock.

I'm pretty sure enqueue_entity() *has* to be called with rq lock.
unthrottle_cfs_rq() is called from tg_set_cfs_bandwidth(),
distribute_cfs_runtime() and unthrottle_offline_cfs_rqs.  The first
two grabs the rq_lock just around the calls and the last one has a
lockdep assert on the rq_lock.  What am I missing?

> So it all seems to depend on that "on_list" flag for exclusion. Which
> seems fundamentally racy, since it's not protected by a lock.

The only place on_list is accessed without holding rq_lock is
unregister_fair_sched_group().  It's a minor optimization on a
relatively cold path (group destruction), so if it's racy there, I
think we can take out that optimization.  I'd be surprised if anyone
notices that.

That said, I don't think it's broken.  False positive on on_list is
fine and I can't see how a false negative would happen given that the
only event which can set it is the sched entity getting scheduled and
there's no way the removal path can't race against that transition.

> But that still makes me go "how come is this only noticed 18 months
> after the fact"?

Unless I'm totally confused, which is definitely possible, I don't
think there's a race condition and the only bug is the
tmp_alone_branch pointer getting dangled, which maybe doesn't happen
all that much?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  1:15             ` Tejun Heo
@ 2018-12-28  1:36               ` Linus Torvalds
  2018-12-28  1:53                 ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2018-12-28  1:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vincent Guittot, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Thu, Dec 27, 2018 at 5:15 PM Tejun Heo <tj@kernel.org> wrote:
>
> I'm pretty sure enqueue_entity() *has* to be called with rq lock.
> unthrottle_cfs_rq() is called from tg_set_cfs_bandwidth(),
> distribute_cfs_runtime() and unthrottle_offline_cfs_rqs.  The first
> two grabs the rq_lock just around the calls and the last one has a
> lockdep assert on the rq_lock.  What am I missing?

No, I think you're right, and I just didn't follow things deep enough,
didn't see any rq locking in the loop in unthrottle_offline_cfs_rqs(),
and didn't realize that the rq is locked by the caller.

> > But that still makes me go "how come is this only noticed 18 months
> > after the fact"?
>
> Unless I'm totally confused, which is definitely possible, I don't
> think there's a race condition and the only bug is the
> tmp_alone_branch pointer getting dangled, which maybe doesn't happen
> all that much?

Ahh. That would explain the list corruption. The next
list_add_leaf_cfs_rq() could try to add to a removed entry.

How would you reset it? Do something like

       rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;

for every removal, or make it conditional on it matching the removed entry?

            Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  1:36               ` Linus Torvalds
@ 2018-12-28  1:53                 ` Tejun Heo
  2018-12-28  2:02                   ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2018-12-28  1:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

Hello,

On Thu, Dec 27, 2018 at 05:36:47PM -0800, Linus Torvalds wrote:
> > Unless I'm totally confused, which is definitely possible, I don't
> > think there's a race condition and the only bug is the
> > tmp_alone_branch pointer getting dangled, which maybe doesn't happen
> > all that much?
> 
> Ahh. That would explain the list corruption. The next
> list_add_leaf_cfs_rq() could try to add to a removed entry.
> 
> How would you reset it? Do something like
> 
>        rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> 
> for every removal, or make it conditional on it matching the removed entry?

Vincent knows that part way better than me but I think the safest way
would be doing the optimization removal iff tmp_alone_branch is
already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
something else only while a branch is being built and deferring
optimization removal by an avg update cycle isn't gonna make any
difference anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  1:53                 ` Tejun Heo
@ 2018-12-28  2:02                   ` Tejun Heo
  2018-12-28  2:30                     ` Xie XiuQi
                                       ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Tejun Heo @ 2018-12-28  2:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vincent Guittot, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Thu, Dec 27, 2018 at 05:53:52PM -0800, Tejun Heo wrote:
> Vincent knows that part way better than me but I think the safest way
> would be doing the optimization removal iff tmp_alone_branch is
> already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
> something else only while a branch is being built and deferring
> optimization removal by an avg update cycle isn't gonna make any
> difference anyway.

So, something like the following.  Xie, can you see whether the
following patch resolves the problem?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1907506318a..88b9118b5191 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
 		 * There can be a lot of idle CPU cgroups.  Don't let fully
 		 * decayed cfs_rqs linger on the list.
 		 */
-		if (cfs_rq_is_decayed(cfs_rq))
+		if (cfs_rq_is_decayed(cfs_rq) &&
+		    rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
 			list_del_leaf_cfs_rq(cfs_rq);
 
 		/* Don't need periodic decay once load/util_avg are null */

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  2:02                   ` Tejun Heo
@ 2018-12-28  2:30                     ` Xie XiuQi
  2018-12-28  5:38                     ` Sargun Dhillon
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Xie XiuQi @ 2018-12-28  2:30 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Vincent Guittot, Sargun Dhillon, Ingo Molnar, Peter Zijlstra,
	xiezhipeng1, huawei.libin, linux-kernel, Dmitry Adamushko,
	Rik van Riel

Hi Tejun,

On 2018/12/28 10:02, Tejun Heo wrote:
> On Thu, Dec 27, 2018 at 05:53:52PM -0800, Tejun Heo wrote:
>> Vincent knows that part way better than me but I think the safest way
>> would be doing the optimization removal iff tmp_alone_branch is
>> already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
>> something else only while a branch is being built and deferring
>> optimization removal by an avg update cycle isn't gonna make any
>> difference anyway.
> 
> So, something like the following.  Xie, can you see whether the
> following patch resolves the problem?

Zhipeng is preparing to test it, thanks.

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1907506318a..88b9118b5191 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
>  		 * There can be a lot of idle CPU cgroups.  Don't let fully
>  		 * decayed cfs_rqs linger on the list.
>  		 */
> -		if (cfs_rq_is_decayed(cfs_rq))
> +		if (cfs_rq_is_decayed(cfs_rq) &&
> +		    rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
>  			list_del_leaf_cfs_rq(cfs_rq);
>  
>  		/* Don't need periodic decay once load/util_avg are null */
> 
> .
> 

-- 
Thanks,
Xie XiuQi


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  2:02                   ` Tejun Heo
  2018-12-28  2:30                     ` Xie XiuQi
@ 2018-12-28  5:38                     ` Sargun Dhillon
  2018-12-28  9:30                     ` Vincent Guittot
  2018-12-28 10:25                     ` Xiezhipeng (EulerOS)
  3 siblings, 0 replies; 27+ messages in thread
From: Sargun Dhillon @ 2018-12-28  5:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Vincent Guittot, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Thu, Dec 27, 2018 at 9:02 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Dec 27, 2018 at 05:53:52PM -0800, Tejun Heo wrote:
> > Vincent knows that part way better than me but I think the safest way
> > would be doing the optimization removal iff tmp_alone_branch is
> > already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
> > something else only while a branch is being built and deferring
> > optimization removal by an avg update cycle isn't gonna make any
> > difference anyway.
>
> So, something like the following.  Xie, can you see whether the
> following patch resolves the problem?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1907506318a..88b9118b5191 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
>                  * There can be a lot of idle CPU cgroups.  Don't let fully
>                  * decayed cfs_rqs linger on the list.
>                  */
> -               if (cfs_rq_is_decayed(cfs_rq))
> +               if (cfs_rq_is_decayed(cfs_rq) &&
> +                   rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
>                         list_del_leaf_cfs_rq(cfs_rq);
>
>                 /* Don't need periodic decay once load/util_avg are null */
Tested-by: Sargun Dhillon <sargun@sargun.me>
We've deployed this patch to our test workload. We haven't seen a crash yet.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  2:02                   ` Tejun Heo
  2018-12-28  2:30                     ` Xie XiuQi
  2018-12-28  5:38                     ` Sargun Dhillon
@ 2018-12-28  9:30                     ` Vincent Guittot
  2018-12-28 14:26                       ` Sargun Dhillon
  2018-12-28 16:54                       ` Tejun Heo
  2018-12-28 10:25                     ` Xiezhipeng (EulerOS)
  3 siblings, 2 replies; 27+ messages in thread
From: Vincent Guittot @ 2018-12-28  9:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Fri, 28 Dec 2018 at 03:02, Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Dec 27, 2018 at 05:53:52PM -0800, Tejun Heo wrote:
> > Vincent knows that part way better than me but I think the safest way
> > would be doing the optimization removal iff tmp_alone_branch is
> > already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
> > something else only while a branch is being built and deferring
> > optimization removal by an avg update cycle isn't gonna make any
> > difference anyway.

But the lock should not be released during the build of a branch and
tmp_alone_branch must always points to rq->leaf_cfs_rq_list at the end
and before the lock is released

I think that there is a bigger problem with commit a9e7f6544b9c and
cfs_rq throttling:
Let take the example of the following topology TG2 --> TG1 --> root
1-The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
one path because it has never been used and can't be throttled so
tmp_alone_branch will point to leaf_cfs_rq_list at the end.
2-Then TG1 is throttled
3-and we add TG3 as a new child of TG1.
4-The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

With commit a9e7f6544b9c, we can del a cfs_rq from
rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes null before
step 2 above, TG1 cfs_rq is removed from the list.
Then at step 4, TG3 cfs_rq is added at the beg of rq->leaf_cfs_rq_list
but tmp_alone_branch still points to TG3 cfs_rq  because its throttled
parent can't be enqueued when the lock is released
tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

so if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
points on another TG cfs_rq, the next TG cfs_rq that will be added,
will be linked  outside rq->leaf_cfs_rq_list

In addition, we can break the ordering of the cfs_rq in
rq->leaf_cfs_rq_list but this ordering is used to update  and
propagate the update from leaf down to root.

>
> So, something like the following.  Xie, can you see whether the
> following patch resolves the problem?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1907506318a..88b9118b5191 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
>                  * There can be a lot of idle CPU cgroups.  Don't let fully
>                  * decayed cfs_rqs linger on the list.
>                  */
> -               if (cfs_rq_is_decayed(cfs_rq))
> +               if (cfs_rq_is_decayed(cfs_rq) &&
> +                   rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
>                         list_del_leaf_cfs_rq(cfs_rq);

This patch reduces the cases but I don't thinks it's enough because it
doesn't cover the case of unregister_fair_sched_group()
And we can still break the ordering of the cfs_rq

>
>                 /* Don't need periodic decay once load/util_avg are null */

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  2:02                   ` Tejun Heo
                                       ` (2 preceding siblings ...)
  2018-12-28  9:30                     ` Vincent Guittot
@ 2018-12-28 10:25                     ` Xiezhipeng (EulerOS)
  3 siblings, 0 replies; 27+ messages in thread
From: Xiezhipeng (EulerOS) @ 2018-12-28 10:25 UTC (permalink / raw)
  To: Tejun Heo, Linus Torvalds
  Cc: Vincent Guittot, Sargun Dhillon, Xiexiuqi, Ingo Molnar,
	Peter Zijlstra, Libin (Huawei),
	linux-kernel, Dmitry Adamushko, Rik van Riel

Hi Tejun,

On Fri, Dec 28, 2018 10:03 AM, Tejun Heo wrote:
> 
> On Thu, Dec 27, 2018 at 05:53:52PM -0800, Tejun Heo wrote:
> > Vincent knows that part way better than me but I think the safest way
> > would be doing the optimization removal iff tmp_alone_branch is
> > already pointing to leaf_cfs_rq_list.  IIUC, it's pointing to
> > something else only while a branch is being built and deferring
> > optimization removal by an avg update cycle isn't gonna make any
> > difference anyway.
> 
> So, something like the following.  Xie, can you see whether the
> following patch resolves the problem?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1907506318a..88b9118b5191 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
>  		 * There can be a lot of idle CPU cgroups.  Don't let fully
>  		 * decayed cfs_rqs linger on the list.
>  		 */
> -		if (cfs_rq_is_decayed(cfs_rq))
> +		if (cfs_rq_is_decayed(cfs_rq) &&
> +		    rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
>  			list_del_leaf_cfs_rq(cfs_rq);
> 
>  		/* Don't need periodic decay once load/util_avg are null */
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>

This patch fixes things for me, we haven't seen a crash yet.

--
Thanks,
Zhipeng Xie

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  9:30                     ` Vincent Guittot
@ 2018-12-28 14:26                       ` Sargun Dhillon
  2018-12-28 16:54                       ` Tejun Heo
  1 sibling, 0 replies; 27+ messages in thread
From: Sargun Dhillon @ 2018-12-28 14:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Tejun Heo, Linus Torvalds, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

>
> But the lock should not be released during the build of a branch and
> tmp_alone_branch must always points to rq->leaf_cfs_rq_list at the end
> and before the lock is released
>
> I think that there is a bigger problem with commit a9e7f6544b9c and
> cfs_rq throttling:
> Let take the example of the following topology TG2 --> TG1 --> root
> 1-The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
> cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
> one path because it has never been used and can't be throttled so
> tmp_alone_branch will point to leaf_cfs_rq_list at the end.
> 2-Then TG1 is throttled
> 3-and we add TG3 as a new child of TG1.
> 4-The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
> cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.
>
> With commit a9e7f6544b9c, we can del a cfs_rq from
> rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes null before
> step 2 above, TG1 cfs_rq is removed from the list.
> Then at step 4, TG3 cfs_rq is added at the beg of rq->leaf_cfs_rq_list
> but tmp_alone_branch still points to TG3 cfs_rq  because its throttled
> parent can't be enqueued when the lock is released
> tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
>
> so if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
> points on another TG cfs_rq, the next TG cfs_rq that will be added,
> will be linked  outside rq->leaf_cfs_rq_list
>
> In addition, we can break the ordering of the cfs_rq in
> rq->leaf_cfs_rq_list but this ordering is used to update  and
> propagate the update from leaf down to root.
>
> >
> > So, something like the following.  Xie, can you see whether the
> > following patch resolves the problem?
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d1907506318a..88b9118b5191 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
> >                  * There can be a lot of idle CPU cgroups.  Don't let fully
> >                  * decayed cfs_rqs linger on the list.
> >                  */
> > -               if (cfs_rq_is_decayed(cfs_rq))
> > +               if (cfs_rq_is_decayed(cfs_rq) &&
> > +                   rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
> >                         list_del_leaf_cfs_rq(cfs_rq);
>
> This patch reduces the cases but I don't thinks it's enough because it
> doesn't cover the case of unregister_fair_sched_group()
> And we can still break the ordering of the cfs_rq
>
> >
> >                 /* Don't need periodic decay once load/util_avg are null */

It looks like you're right. This patch, in our cluster, overnight
showed failures, but at least DEBUG_LIST caught them.

[10226.131822] ------------[ cut here ]------------
[10226.133094] kernel BUG at lib/list_debug.c:25!
[10226.134328] invalid opcode: 0000 [#1] SMP PTI
[10226.135540] CPU: 43 PID: 2005156 Comm: java Kdump: loaded Not
tainted 4.19.12netflix-46014-g794c368-dirty #13
[10226.138125] Hardware name: Amazon EC2 r5.24xlarge/, BIOS 1.0 10/16/2017
[10226.139894] RIP: 0010:__list_add_valid+0x38/0x90
[10226.141174] Code: 75 19 48 8b 32 48 39 f0 75 2e 48 39 fa 74 4c 48
39 f8 74 47 b8 01 00 00 00 5d c3 48 89 c1 48 c7 c7 00 b8 cc b9 e8 4b
13 b0 ff <0f> 0b 48 c7 c7 b0 f2 02 ba e8 aa d8 06 00 48 89 d1 48 c7 c7
58 b8
[10226.146855] RSP: 0018:ffffa944a4e83b50 EFLAGS: 00010082
[10226.149231] RAX: 0000000000000075 RBX: ffff8e7d9a133a00 RCX: 0000000000000000
[10226.152077] RDX: 0000000000000000 RSI: ffff8e7db76d6858 RDI: ffff8e7db76d6858
[10226.154922] RBP: ffffa944a4e83b50 R08: 0000000000000000 R09: 0000000000000000
[10226.157786] R10: 0000000000000000 R11: ffff8e7d670a1f00 R12: ffff8e7db7723f40
[10226.160630] R13: ffff8e7673ea6180 R14: ffff8e7d9a133b80 R15: ffff8e79f6693580
[10226.163502] FS:  00007f534e3d3700(0000) GS:ffff8e7db76c0000(0000)
knlGS:0000000000000000
[10226.167523] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10226.170000] CR2: 00007f5cd803bb08 CR3: 0000005ba351e005 CR4: 00000000003606e0
[10226.172811] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10226.175656] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10226.178503] Call Trace:
[10226.180170]  enqueue_entity+0x217/0x7e0
[10226.182202]  ? trace_hardirqs_off+0x22/0xe0
[10226.184307]  enqueue_task_fair+0x99/0x7b0
[10226.186376]  activate_task+0x33/0xa0
[10226.188322]  ttwu_do_activate+0x4b/0xb0
[10226.190346]  try_to_wake_up+0x281/0x8b0
[10226.192364]  ? find_held_lock+0x2e/0xb0
[10226.194376]  wake_up_q+0x32/0x80
[10226.196253]  futex_wake+0x15a/0x190
[10226.198178]  do_futex+0x34c/0xec0
[10226.200074]  ? lockdep_hardirqs_on+0x146/0x1e0
[10226.202236]  ? _raw_spin_unlock_irq+0x2c/0x40
[10226.204359]  ? trace_hardirqs_on+0x3f/0xe0
[10226.206453]  ? _raw_spin_unlock_irq+0x2c/0x40
[10226.208598]  ? finish_task_switch+0x8c/0x2b0
[10226.215629]  ? finish_task_switch+0x5f/0x2b0
[10226.217793]  ? __schedule+0x3dc/0xbe0
[10226.219760]  __x64_sys_futex+0x8b/0x180
[10226.221825]  ? trace_hardirqs_off_caller+0x22/0xe0
[10226.224102]  ? lockdep_hardirqs_on+0x146/0x1e0
[10226.226280]  ? do_syscall_64+0x19/0x1e0
[10226.228281]  do_syscall_64+0x67/0x1e0
[10226.230254]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[10226.232522] RIP: 0033:0x7f536705ebb1
[10226.234458] Code: 0f 05 48 3d 00 f0 ff ff 0f 87 12 02 00 00 45 84
c9 74 2d 4a 8d 7c ab 28 45 31 d2 ba 01 00 00 00 4c 89 f6 b8 ca 00 00
00 0f 05 <48> 3d 00 f0 ff ff 76 0e 83 f8 ea 74 09 83 f8 f2 0f 85 ae 00
00 00
[10226.242010] RSP: 002b:00007f534e3d2b60 EFLAGS: 00000246 ORIG_RAX:
00000000000000ca
[10226.245872] RAX: ffffffffffffffda RBX: 00007f536008ec50 RCX: 00007f536705ebb1
[10226.248707] RDX: 0000000000000001 RSI: 0000000000000081 RDI: 00007f536008ec78
[10226.251521] RBP: 00007f536008ec78 R08: 00007f536008ec70 R09: 0000000000000001
[10226.254301] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f536008ec64
[10226.257108] R13: 0000000000000000 R14: 0000000000000081 R15: 00007f536008ec50
[10226.259937] Modules linked in: veth sch_fq_codel ipvlan ipt_REJECT
nf_reject_ipv4 xt_tcpudp xt_owner act_mirred cls_matchall sch_ingress
cls_bpf sch_htb ifb iptable_filter ip_tables xt_conntrack x_tables
nf_nat bridge stp llc nf_conntrack_netlink nfnetlink nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo overlay binfmt_misc
nfit xfs input_leds intel_rapl_perf parport_pc i2c_piix4 parport
serio_raw tcp_bbr sunrpc autofs4 btrfs raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx xor
raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64
crypto_simd cryptd glue_helper intel_agp intel_gtt ena agpgart

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28  9:30                     ` Vincent Guittot
  2018-12-28 14:26                       ` Sargun Dhillon
@ 2018-12-28 16:54                       ` Tejun Heo
  2018-12-28 17:25                         ` Vincent Guittot
  1 sibling, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2018-12-28 16:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

Hello,

On Fri, Dec 28, 2018 at 10:30:07AM +0100, Vincent Guittot wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d1907506318a..88b9118b5191 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
> >                  * There can be a lot of idle CPU cgroups.  Don't let fully
> >                  * decayed cfs_rqs linger on the list.
> >                  */
> > -               if (cfs_rq_is_decayed(cfs_rq))
> > +               if (cfs_rq_is_decayed(cfs_rq) &&
> > +                   rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
> >                         list_del_leaf_cfs_rq(cfs_rq);
> 
> This patch reduces the cases but I don't thinks it's enough because it
> doesn't cover the case of unregister_fair_sched_group()
> And we can still break the ordering of the cfs_rq

So, if unregister_fair_sched_group() can corrupt list, the bug is
there regardless of a9e7f6544b9ce, right?

Is there a reason why we're building a dedicated list for avg
propagation?  AFAICS, it's just doing depth first walk, which can be
done without extra space as long as each node has the parent pointer,
which they do.  Is the dedicated list an optimization?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28 16:54                       ` Tejun Heo
@ 2018-12-28 17:25                         ` Vincent Guittot
  2018-12-28 17:46                           ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2018-12-28 17:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Fri, 28 Dec 2018 at 17:54, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Dec 28, 2018 at 10:30:07AM +0100, Vincent Guittot wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index d1907506318a..88b9118b5191 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7698,7 +7698,8 @@ static void update_blocked_averages(int cpu)
> > >                  * There can be a lot of idle CPU cgroups.  Don't let fully
> > >                  * decayed cfs_rqs linger on the list.
> > >                  */
> > > -               if (cfs_rq_is_decayed(cfs_rq))
> > > +               if (cfs_rq_is_decayed(cfs_rq) &&
> > > +                   rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
> > >                         list_del_leaf_cfs_rq(cfs_rq);
> >
> > This patch reduces the cases but I don't thinks it's enough because it
> > doesn't cover the case of unregister_fair_sched_group()
> > And we can still break the ordering of the cfs_rq
>
> So, if unregister_fair_sched_group() can corrupt list, the bug is
> there regardless of a9e7f6544b9ce, right?

I don't think so because without a9e7f6544b9ce, the insertion in the
list is done only once and we can't  call unregister_fair_sched_group
while an enqueue is ongoing so tmp_alone_branch always point to
rq->leaf_cfs_rq_list.
a9e7f6544b9ce enables to have tmp_alone_branch not pointing to
rq->leaf_cfs_rq_list

>
> Is there a reason why we're building a dedicated list for avg
> propagation?  AFAICS, it's just doing depth first walk, which can be

we have this list of sched group of the rq to update the load of each
cfs_rq and as a result the load of the task group. This is then used
to compute the share of a task group between CPUs.
This list must be ordered to correctly propagate the updates from leafs to root.

> done without extra space as long as each node has the parent pointer,
> which they do.  Is the dedicated list an optimization?

It prevents to parse and walk all task group struct every time.
Instead, you just have to follow a linked list

Regards,
Vincent
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28 17:25                         ` Vincent Guittot
@ 2018-12-28 17:46                           ` Tejun Heo
  2018-12-28 18:04                             ` Vincent Guittot
  0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2018-12-28 17:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Linus Torvalds, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Fri, Dec 28, 2018 at 06:25:37PM +0100, Vincent Guittot wrote:
> > done without extra space as long as each node has the parent pointer,
> > which they do.  Is the dedicated list an optimization?
> 
> It prevents to parse and walk all task group struct every time.
> Instead, you just have to follow a linked list

Hmmm... I'm having a bit of a hard time imagining doing an actual
traversal being a meaningful optimization.  It may require more
branches but that shouldn't be expensive at all, especially compared
to walking all idle groups in the system each time which the code used
to do. Anyways, this is tangential.

Thanks for the explanation and happy new year!

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-28 17:46                           ` Tejun Heo
@ 2018-12-28 18:04                             ` Vincent Guittot
  0 siblings, 0 replies; 27+ messages in thread
From: Vincent Guittot @ 2018-12-28 18:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Sargun Dhillon, Xie XiuQi, Ingo Molnar,
	Peter Zijlstra, xiezhipeng1, huawei.libin, linux-kernel,
	Dmitry Adamushko, Rik van Riel

On Fri, 28 Dec 2018 at 18:46, Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Dec 28, 2018 at 06:25:37PM +0100, Vincent Guittot wrote:
> > > done without extra space as long as each node has the parent pointer,
> > > which they do.  Is the dedicated list an optimization?
> >
> > It prevents to parse and walk all task group struct every time.
> > Instead, you just have to follow a linked list
>
> Hmmm... I'm having a bit of a hard time imagining doing an actual
> traversal being a meaningful optimization.  It may require more
> branches but that shouldn't be expensive at all, especially compared
> to walking all idle groups in the system each time which the code used
> to do. Anyways, this is tangential.
>
> Thanks for the explanation and happy new year!

Happy new year for you too

Thanks
Vincent

>
> --
> tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-27  9:21 ` Vincent Guittot
  2018-12-27 10:21   ` Vincent Guittot
@ 2018-12-30 12:04   ` Ingo Molnar
  2018-12-30 12:31     ` [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Ingo Molnar
  2018-12-30 12:36     ` [PATCH] sched: fix infinity loop in update_blocked_averages Vincent Guittot
  1 sibling, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2018-12-30 12:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Xie XiuQi, Ingo Molnar, Peter Zijlstra, xiezhipeng1,
	huawei.libin, linux-kernel, Linus Torvalds, Tejun Heo,
	Peter Zijlstra


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > Cc: Bin Li <huawei.libin@huawei.com>
> > Cc: <stable@vger.kernel.org>        [4.10+]
> > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> 
> If it only happens in update_blocked_averages(), the del leaf has been added by:
> a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)

So I think until we are confident in the proposed fixes, how about 
applying Linus's patch that reverts a9e7f6544b9c and simplifies the list 
manipulation?

That way we can re-introduce the O(nr_cgroups) optimization without 
pressure.

I'll prepare a commit for sched/urgent that does this, please holler if 
any of you disagrees!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
  2018-12-30 12:04   ` Ingo Molnar
@ 2018-12-30 12:31     ` Ingo Molnar
  2018-12-30 12:36     ` [PATCH] sched: fix infinity loop in update_blocked_averages Vincent Guittot
  1 sibling, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2018-12-30 12:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Xie XiuQi, Ingo Molnar, Peter Zijlstra, xiezhipeng1,
	huawei.libin, linux-kernel, Linus Torvalds, Tejun Heo,
	Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
> > > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > Cc: Bin Li <huawei.libin@huawei.com>
> > > Cc: <stable@vger.kernel.org>        [4.10+]
> > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> > 
> > If it only happens in update_blocked_averages(), the del leaf has been added by:
> > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
> 
> So I think until we are confident in the proposed fixes, how about 
> applying Linus's patch that reverts a9e7f6544b9c and simplifies the list 
> manipulation?
> 
> That way we can re-introduce the O(nr_cgroups) optimization without 
> pressure.
> 
> I'll prepare a commit for sched/urgent that does this, please holler if 
> any of you disagrees!

I've applied the patch below to tip:sched/urgent and I'll push it out if 
all goes well in testing:

  1e2adc76e619: ("sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

I've preemptively added the Tested-by tags of the gents who found and 
analyzed this bug:

  Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
  Tested-by: Sargun Dhillon <sargun@sargun.me>

... in the assumption that you'll do the testing of Linus's fix to make 
sure it's all good!

[ Will probably update the commit with acks and any other feedback before
  sending it to Linus tomorrow-ish. We don't want to end 2018 with a
  known scheduler bug in the upstream tree! ;-) ]

Thanks,

	Ingo

===========================>
From 1e2adc76e61924cdfd9dc50c728044d0fbbade27 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 27 Dec 2018 13:46:17 -0800
Subject: [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c

Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
manipulation.

Do a (manual) revert of:

  a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:

  9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

As Vincent Guittot explains:

 "I think that there is a bigger problem with commit a9e7f6544b9c and
  cfs_rq throttling:

  Let take the example of the following topology TG2 --> TG1 --> root:

   1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
      cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
      one path because it has never been used and can't be throttled so
      tmp_alone_branch will point to leaf_cfs_rq_list at the end.

   2) Then TG1 is throttled

   3) and we add TG3 as a new child of TG1.

   4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
      cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

  With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
  So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
  cfs_rq is removed from the list.
  Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
  but tmp_alone_branch still points to TG3 cfs_rq because its throttled
  parent can't be enqueued when the lock is released.
  tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

  So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
  points on another TG cfs_rq, the next TG cfs_rq that will be added,
  will be linked outside rq->leaf_cfs_rq_list - which is bad.

  In addition, we can break the ordering of the cfs_rq in
  rq->leaf_cfs_rq_list but this ordering is used to update and
  propagate the update from leaf down to root."

Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b9c - which change was clearly
not thought through completely.

This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)

[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Reported-by: Sargun Dhillon <sargun@sargun.me>
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Tested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin@huawei.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 43 +++++++++----------------------------------
 1 file changed, 9 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1907506318a..6483834f1278 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -352,10 +352,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 	}
 }
 
-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
-	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
-				 leaf_cfs_rq_list)
+/* Iterate through all leaf cfs_rq's on a runqueue: */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 }
 
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
-		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
+#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7647,27 +7646,10 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load.weight)
-		return false;
-
-	if (cfs_rq->avg.load_sum)
-		return false;
-
-	if (cfs_rq->avg.util_sum)
-		return false;
-
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
-	return true;
-}
-
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7679,7 +7661,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
 		/* throttled entities do not contribute to load */
@@ -7694,13 +7676,6 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
-		/*
-		 * There can be a lot of idle CPU cgroups.  Don't let fully
-		 * decayed cfs_rqs linger on the list.
-		 */
-		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
-
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10570,10 +10545,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-30 12:04   ` Ingo Molnar
  2018-12-30 12:31     ` [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Ingo Molnar
@ 2018-12-30 12:36     ` Vincent Guittot
  2018-12-30 12:54       ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2018-12-30 12:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Xie XiuQi, Ingo Molnar, Peter Zijlstra, xiezhipeng1,
	huawei.libin, linux-kernel, Linus Torvalds, Tejun Heo,
	Peter Zijlstra

On Sun, 30 Dec 2018 at 13:04, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> > > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > Cc: Bin Li <huawei.libin@huawei.com>
> > > Cc: <stable@vger.kernel.org>        [4.10+]
> > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> >
> > If it only happens in update_blocked_averages(), the del leaf has been added by:
> > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
>
> So I think until we are confident in the proposed fixes, how about
> applying Linus's patch that reverts a9e7f6544b9c and simplifies the list
> manipulation?

looks good to me

Thanks
Vincent
>
> That way we can re-introduce the O(nr_cgroups) optimization without
> pressure.
>
> I'll prepare a commit for sched/urgent that does this, please holler if
> any of you disagrees!
>
> Thanks,
>
>         Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] sched: fix infinity loop in update_blocked_averages
  2018-12-30 12:36     ` [PATCH] sched: fix infinity loop in update_blocked_averages Vincent Guittot
@ 2018-12-30 12:54       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2018-12-30 12:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Xie XiuQi, Ingo Molnar, Peter Zijlstra, xiezhipeng1,
	huawei.libin, linux-kernel, Linus Torvalds, Tejun Heo,
	Peter Zijlstra


* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> On Sun, 30 Dec 2018 at 13:04, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > * Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >
> > > > Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
> > > > Cc: Bin Li <huawei.libin@huawei.com>
> > > > Cc: <stable@vger.kernel.org>        [4.10+]
> > > > Fixes: 9c2791f936ef (sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list)
> > >
> > > If it only happens in update_blocked_averages(), the del leaf has been added by:
> > > a9e7f6544b9c (sched/fair: Fix O(nr_cgroups) in load balance path)
> >
> > So I think until we are confident in the proposed fixes, how about
> > applying Linus's patch that reverts a9e7f6544b9c and simplifies the list
> > manipulation?
> 
> looks good to me

Excellent - I've added your Acked-by.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [tip:sched/urgent] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
  2018-12-27  3:04 [PATCH] sched: fix infinity loop in update_blocked_averages Xie XiuQi
  2018-12-27  9:21 ` Vincent Guittot
@ 2018-12-30 13:00 ` tip-bot for Linus Torvalds
  1 sibling, 0 replies; 27+ messages in thread
From: tip-bot for Linus Torvalds @ 2018-12-30 13:00 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, efault, xiexiuqi, linux-kernel, tj, peterz, tglx,
	xiezhipeng1, vincent.guittot, huawei.libin, sargun, mingo

Commit-ID:  c40f7d74c741a907cfaeb73a7697081881c497d0
Gitweb:     https://git.kernel.org/tip/c40f7d74c741a907cfaeb73a7697081881c497d0
Author:     Linus Torvalds <torvalds@linux-foundation.org>
AuthorDate: Thu, 27 Dec 2018 13:46:17 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 30 Dec 2018 13:54:31 +0100

sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c

Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
manipulation.

Do a (manual) revert of:

  a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:

  9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

As Vincent Guittot explains:

 "I think that there is a bigger problem with commit a9e7f6544b9c and
  cfs_rq throttling:

  Let take the example of the following topology TG2 --> TG1 --> root:

   1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
      cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
      one path because it has never been used and can't be throttled so
      tmp_alone_branch will point to leaf_cfs_rq_list at the end.

   2) Then TG1 is throttled

   3) and we add TG3 as a new child of TG1.

   4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
      cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

  With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
  So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
  cfs_rq is removed from the list.
  Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
  but tmp_alone_branch still points to TG3 cfs_rq because its throttled
  parent can't be enqueued when the lock is released.
  tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

  So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
  points on another TG cfs_rq, the next TG cfs_rq that will be added,
  will be linked outside rq->leaf_cfs_rq_list - which is bad.

  In addition, we can break the ordering of the cfs_rq in
  rq->leaf_cfs_rq_list but this ordering is used to update and
  propagate the update from leaf down to root."

Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b9c - which change was clearly
not thought through completely.

This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)

[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Reported-by: Sargun Dhillon <sargun@sargun.me>
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Tested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin@huawei.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 43 +++++++++----------------------------------
 1 file changed, 9 insertions(+), 34 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1907506318a..6483834f1278 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -352,10 +352,9 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 	}
 }
 
-/* Iterate thr' all leaf cfs_rq's on a runqueue */
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)			\
-	list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,	\
-				 leaf_cfs_rq_list)
+/* Iterate through all leaf cfs_rq's on a runqueue: */
+#define for_each_leaf_cfs_rq(rq, cfs_rq) \
+	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
 
 /* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
@@ -447,8 +446,8 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 }
 
-#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)	\
-		for (cfs_rq = &rq->cfs, pos = NULL; cfs_rq; cfs_rq = pos)
+#define for_each_leaf_cfs_rq(rq, cfs_rq)	\
+		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
 static inline struct sched_entity *parent_entity(struct sched_entity *se)
 {
@@ -7647,27 +7646,10 @@ static inline bool others_have_blocked(struct rq *rq)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
-{
-	if (cfs_rq->load.weight)
-		return false;
-
-	if (cfs_rq->avg.load_sum)
-		return false;
-
-	if (cfs_rq->avg.util_sum)
-		return false;
-
-	if (cfs_rq->avg.runnable_load_sum)
-		return false;
-
-	return true;
-}
-
 static void update_blocked_averages(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
 	bool done = true;
@@ -7679,7 +7661,7 @@ static void update_blocked_averages(int cpu)
 	 * Iterates the task_group tree in a bottom up fashion, see
 	 * list_add_leaf_cfs_rq() for details.
 	 */
-	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
 		struct sched_entity *se;
 
 		/* throttled entities do not contribute to load */
@@ -7694,13 +7676,6 @@ static void update_blocked_averages(int cpu)
 		if (se && !skip_blocked_update(se))
 			update_load_avg(cfs_rq_of(se), se, 0);
 
-		/*
-		 * There can be a lot of idle CPU cgroups.  Don't let fully
-		 * decayed cfs_rqs linger on the list.
-		 */
-		if (cfs_rq_is_decayed(cfs_rq))
-			list_del_leaf_cfs_rq(cfs_rq);
-
 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
 			done = false;
@@ -10570,10 +10545,10 @@ const struct sched_class fair_sched_class = {
 #ifdef CONFIG_SCHED_DEBUG
 void print_cfs_stats(struct seq_file *m, int cpu)
 {
-	struct cfs_rq *cfs_rq, *pos;
+	struct cfs_rq *cfs_rq;
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq_safe(cpu_rq(cpu), cfs_rq, pos)
+	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
 		print_cfs_rq(m, cpu, cfs_rq);
 	rcu_read_unlock();
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-12-30 13:01 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-27  3:04 [PATCH] sched: fix infinity loop in update_blocked_averages Xie XiuQi
2018-12-27  9:21 ` Vincent Guittot
2018-12-27 10:21   ` Vincent Guittot
2018-12-27 10:23     ` Vincent Guittot
2018-12-27 16:39       ` Sargun Dhillon
2018-12-27 17:01         ` Vincent Guittot
2018-12-27 18:15           ` Linus Torvalds
2018-12-27 21:08             ` Sargun Dhillon
2018-12-27 21:46               ` Linus Torvalds
2018-12-28  1:15             ` Tejun Heo
2018-12-28  1:36               ` Linus Torvalds
2018-12-28  1:53                 ` Tejun Heo
2018-12-28  2:02                   ` Tejun Heo
2018-12-28  2:30                     ` Xie XiuQi
2018-12-28  5:38                     ` Sargun Dhillon
2018-12-28  9:30                     ` Vincent Guittot
2018-12-28 14:26                       ` Sargun Dhillon
2018-12-28 16:54                       ` Tejun Heo
2018-12-28 17:25                         ` Vincent Guittot
2018-12-28 17:46                           ` Tejun Heo
2018-12-28 18:04                             ` Vincent Guittot
2018-12-28 10:25                     ` Xiezhipeng (EulerOS)
2018-12-30 12:04   ` Ingo Molnar
2018-12-30 12:31     ` [PATCH] sched: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c Ingo Molnar
2018-12-30 12:36     ` [PATCH] sched: fix infinity loop in update_blocked_averages Vincent Guittot
2018-12-30 12:54       ` Ingo Molnar
2018-12-30 13:00 ` [tip:sched/urgent] sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c tip-bot for Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).