All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
@ 2018-04-19 17:29 Kirill Tkhai
  2018-04-20  9:25 ` Juri Lelli
  2018-04-25 17:55 ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-19 17:29 UTC (permalink / raw)
  To: mingo, peterz, linux-kernel, ktkhai

tg_rt_schedulable() iterates over all child task groups,
while tg_has_rt_tasks() iterates over all linked tasks.
In case of systems with big number of tasks, this may
take a lot of time.

I observed hard LOCKUP on machine with 20000+ processes
after write to "cpu.rt_period_us" of cpu cgroup with
39 children. The problem occurred because of tasklist_lock
is held for a long time and other processes can't do fork().

PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
 #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
 #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
 #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
 #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
    [exception RIP: tg_rt_schedulable+463]
    RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
    RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
    RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
    R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
    R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
 #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
 #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
 #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
 #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
 #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
#10 [ffff886537ad7f08] sys_write at ffffffff8121faff
#11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d

The patch reworks tg_has_rt_tasks() and makes it to check
for rt_rq::rt_nr_running instead of iteration over task list.
This makes the function to scale well, and its execution time
does not depend on number of processes in the system.

Note, that since tasklist_lock doesn't protect a task against
sched_class changing, we don't introduce new races in comparison
to that we had before. Also, rt_rq::rt_nr_running contains queued
child cfs_rq in additional to queued task. Since tg_has_rt_tasks()
is used in case of !runtime case:

	if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
		return -EBUSY;

the behaviour won't change. The only change is that walk_tg_tree()
calling tg_rt_schedulable() will break its iteration on parent cfs_rq,
i.e. earlier.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 kernel/sched/rt.c |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7aef6b4e885a..601151bb9322 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2395,10 +2395,10 @@ const struct sched_class rt_sched_class = {
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-/* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
-	struct task_struct *g, *p;
+	struct rt_rq *rt_rq;
+	int cpu, ret = 0;
 
 	/*
 	 * Autogroups do not have RT tasks; see autogroup_create().
@@ -2406,12 +2406,18 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
 	if (task_group_is_autogroup(tg))
 		return 0;
 
-	for_each_process_thread(g, p) {
-		if (rt_task(p) && task_group(p) == tg)
-			return 1;
+	preempt_disable();
+
+	for_each_online_cpu(cpu) {
+		rt_rq = tg->rt_rq[cpu];
+		if (READ_ONCE(rt_rq->rt_nr_running)) {
+			ret = 1;
+			break;
+		}
 	}
 
-	return 0;
+	preempt_enable();
+	return ret;
 }
 
 struct rt_schedulable_data {
@@ -2510,7 +2516,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 		return -EINVAL;
 
 	mutex_lock(&rt_constraints_mutex);
-	read_lock(&tasklist_lock);
 	err = __rt_schedulable(tg, rt_period, rt_runtime);
 	if (err)
 		goto unlock;
@@ -2528,7 +2533,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	}
 	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
 unlock:
-	read_unlock(&tasklist_lock);
 	mutex_unlock(&rt_constraints_mutex);
 
 	return err;
@@ -2582,9 +2586,7 @@ static int sched_rt_global_constraints(void)
 	int ret = 0;
 
 	mutex_lock(&rt_constraints_mutex);
-	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
-	read_unlock(&tasklist_lock);
 	mutex_unlock(&rt_constraints_mutex);
 
 	return ret;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-19 17:29 [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks() Kirill Tkhai
@ 2018-04-20  9:25 ` Juri Lelli
  2018-04-20  9:43   ` Kirill Tkhai
  2018-04-25 17:55 ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2018-04-20  9:25 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: mingo, peterz, linux-kernel

Hi Kirill,

On 19/04/18 20:29, Kirill Tkhai wrote:
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
> 
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
> 
> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>     [exception RIP: tg_rt_schedulable+463]
>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> 
> The patch reworks tg_has_rt_tasks() and makes it to check
> for rt_rq::rt_nr_running instead of iteration over task list.
> This makes the function to scale well, and its execution time
> does not depend on number of processes in the system.
> 
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before. Also, rt_rq::rt_nr_running contains queued
> child cfs_rq in additional to queued task. Since tg_has_rt_tasks()

s/cfs_/rt_/ , right?

> is used in case of !runtime case:
> 
> 	if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
> 		return -EBUSY;
> 
> the behaviour won't change. The only change is that walk_tg_tree()
> calling tg_rt_schedulable() will break its iteration on parent cfs_rq,

Ditto.

> i.e. earlier.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  kernel/sched/rt.c |   22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7aef6b4e885a..601151bb9322 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2395,10 +2395,10 @@ const struct sched_class rt_sched_class = {
>   */
>  static DEFINE_MUTEX(rt_constraints_mutex);
>  
> -/* Must be called with tasklist_lock held */
>  static inline int tg_has_rt_tasks(struct task_group *tg)
>  {
> -	struct task_struct *g, *p;
> +	struct rt_rq *rt_rq;
> +	int cpu, ret = 0;
>  
>  	/*
>  	 * Autogroups do not have RT tasks; see autogroup_create().
> @@ -2406,12 +2406,18 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>  	if (task_group_is_autogroup(tg))
>  		return 0;
>  
> -	for_each_process_thread(g, p) {
> -		if (rt_task(p) && task_group(p) == tg)
> -			return 1;
> +	preempt_disable();
> +
> +	for_each_online_cpu(cpu) {
> +		rt_rq = tg->rt_rq[cpu];
> +		if (READ_ONCE(rt_rq->rt_nr_running)) {

Isn't this however checking against the current (dynamic) number of
runnable tasks/groups instead of the "static" group membership (which
shouldn't be affected by a task running state)?

Best,

- Juri

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20  9:25 ` Juri Lelli
@ 2018-04-20  9:43   ` Kirill Tkhai
  2018-04-20 10:06     ` [PATCH v2] " Kirill Tkhai
  2018-04-20 10:58     ` [PATCH] sched/rt: Rework " Juri Lelli
  0 siblings, 2 replies; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-20  9:43 UTC (permalink / raw)
  To: Juri Lelli; +Cc: mingo, peterz, linux-kernel

Hi, Juri,

On 20.04.2018 12:25, Juri Lelli wrote:
> Hi Kirill,
> 
> On 19/04/18 20:29, Kirill Tkhai wrote:
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>>     [exception RIP: tg_rt_schedulable+463]
>>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <NMI exception stack> ---
>>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to check
>> for rt_rq::rt_nr_running instead of iteration over task list.
>> This makes the function to scale well, and its execution time
>> does not depend on number of processes in the system.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before. Also, rt_rq::rt_nr_running contains queued
>> child cfs_rq in additional to queued task. Since tg_has_rt_tasks()
> 
> s/cfs_/rt_/ , right?
> 
>> is used in case of !runtime case:
>>
>> 	if (rt_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
>> 		return -EBUSY;
>>
>> the behaviour won't change. The only change is that walk_tg_tree()
>> calling tg_rt_schedulable() will break its iteration on parent cfs_rq,
> 
> Ditto.
> 
>> i.e. earlier.
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>  kernel/sched/rt.c |   22 ++++++++++++----------
>>  1 file changed, 12 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 7aef6b4e885a..601151bb9322 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2395,10 +2395,10 @@ const struct sched_class rt_sched_class = {
>>   */
>>  static DEFINE_MUTEX(rt_constraints_mutex);
>>  
>> -/* Must be called with tasklist_lock held */
>>  static inline int tg_has_rt_tasks(struct task_group *tg)
>>  {
>> -	struct task_struct *g, *p;
>> +	struct rt_rq *rt_rq;
>> +	int cpu, ret = 0;
>>  
>>  	/*
>>  	 * Autogroups do not have RT tasks; see autogroup_create().
>> @@ -2406,12 +2406,18 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>>  	if (task_group_is_autogroup(tg))
>>  		return 0;
>>  
>> -	for_each_process_thread(g, p) {
>> -		if (rt_task(p) && task_group(p) == tg)
>> -			return 1;
>> +	preempt_disable();
>> +
>> +	for_each_online_cpu(cpu) {
>> +		rt_rq = tg->rt_rq[cpu];
>> +		if (READ_ONCE(rt_rq->rt_nr_running)) {
> 
> Isn't this however checking against the current (dynamic) number of
> runnable tasks/groups instead of the "static" group membership (which
> shouldn't be affected by a task running state)?

Ah, you are sure. I forgot that rt_nr_running does not contain sleeping tasks.
We need to check something else here. I'll try to find another way.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20  9:43   ` Kirill Tkhai
@ 2018-04-20 10:06     ` Kirill Tkhai
  2018-04-20 14:11       ` Juri Lelli
                         ` (2 more replies)
  2018-04-20 10:58     ` [PATCH] sched/rt: Rework " Juri Lelli
  1 sibling, 3 replies; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-20 10:06 UTC (permalink / raw)
  To: Juri Lelli, mingo, peterz; +Cc: linux-kernel

From: Kirill Tkhai <ktkhai@virtuozzo.com>

tg_rt_schedulable() iterates over all child task groups,
while tg_has_rt_tasks() iterates over all linked tasks.
In case of systems with big number of tasks, this may
take a lot of time.

I observed hard LOCKUP on machine with 20000+ processes
after write to "cpu.rt_period_us" of cpu cgroup with
39 children. The problem occurred because of tasklist_lock
is held for a long time and other processes can't do fork().

PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
 #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
 #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
 #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
 #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
    [exception RIP: tg_rt_schedulable+463]
    RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
    RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
    RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
    R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
    R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
 #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
 #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
 #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
 #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
 #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
#10 [ffff886537ad7f08] sys_write at ffffffff8121faff
#11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d

The patch reworks tg_has_rt_tasks() and makes it to iterate over
task group process list instead of iteration over all tasks list.
This makes the function to scale well, and reduces its execution
time.

Note, that since tasklist_lock doesn't protect a task against
sched_class changing, we don't introduce new races in comparison
to that we had before.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 kernel/sched/rt.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7aef6b4e885a..20be796d4e63 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-/* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
-	struct task_struct *g, *p;
+	struct css_task_iter it;
+	struct task_struct *task;
+	int ret = 0;
 
 	/*
 	 * Autogroups do not have RT tasks; see autogroup_create().
@@ -2406,12 +2407,15 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
 	if (task_group_is_autogroup(tg))
 		return 0;
 
-	for_each_process_thread(g, p) {
-		if (rt_task(p) && task_group(p) == tg)
-			return 1;
-	}
+	css_task_iter_start(&tg->css, 0, &it);
+	while ((task = css_task_iter_next(&it)))
+		if (rt_task(task)) {
+			ret = 1;
+			break;
+		}
+	css_task_iter_end(&it);
 
-	return 0;
+	return ret;
 }
 
 struct rt_schedulable_data {
@@ -2510,7 +2514,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 		return -EINVAL;
 
 	mutex_lock(&rt_constraints_mutex);
-	read_lock(&tasklist_lock);
 	err = __rt_schedulable(tg, rt_period, rt_runtime);
 	if (err)
 		goto unlock;
@@ -2528,7 +2531,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	}
 	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
 unlock:
-	read_unlock(&tasklist_lock);
 	mutex_unlock(&rt_constraints_mutex);
 
 	return err;
@@ -2582,9 +2584,7 @@ static int sched_rt_global_constraints(void)
 	int ret = 0;
 
 	mutex_lock(&rt_constraints_mutex);
-	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
-	read_unlock(&tasklist_lock);
 	mutex_unlock(&rt_constraints_mutex);
 
 	return ret;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20  9:43   ` Kirill Tkhai
  2018-04-20 10:06     ` [PATCH v2] " Kirill Tkhai
@ 2018-04-20 10:58     ` Juri Lelli
  2018-04-20 11:21       ` Kirill Tkhai
  1 sibling, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2018-04-20 10:58 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: mingo, peterz, linux-kernel

On 20/04/18 12:43, Kirill Tkhai wrote:
> On 20.04.2018 12:25, Juri Lelli wrote:

[...]

> > Isn't this however checking against the current (dynamic) number of
> > runnable tasks/groups instead of the "static" group membership (which
> > shouldn't be affected by a task running state)?
> 
> Ah, you are sure. I forgot that rt_nr_running does not contain sleeping tasks.
> We need to check something else here. I'll try to find another way.

n/p. Maybe a per rt_rq flag linked to "static" membership (I didn't
really thought this through though :).

BTW, since you faced this problem, I guess this is on RT_GROUP_SCHED
enabled boxes, so I'd have a couple of questions (not strictly related
to the problem at hand):

 - do such boxes rely on RT throttling being performed at root level?
 - is RT_RUNTIME_SHARE enabled?

Thanks,

- Juri

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20 10:58     ` [PATCH] sched/rt: Rework " Juri Lelli
@ 2018-04-20 11:21       ` Kirill Tkhai
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-20 11:21 UTC (permalink / raw)
  To: Juri Lelli; +Cc: mingo, peterz, linux-kernel

On 20.04.2018 13:58, Juri Lelli wrote:
> On 20/04/18 12:43, Kirill Tkhai wrote:
>> On 20.04.2018 12:25, Juri Lelli wrote:
> 
> [...]
> 
>>> Isn't this however checking against the current (dynamic) number of
>>> runnable tasks/groups instead of the "static" group membership (which
>>> shouldn't be affected by a task running state)?
>>
>> Ah, you are sure. I forgot that rt_nr_running does not contain sleeping tasks.
>> We need to check something else here. I'll try to find another way.
> 
> n/p. Maybe a per rt_rq flag linked to "static" membership (I didn't
> really thought this through though :).

sched_move_task() does not change any rt_rq's fields on moving a dequeued
task, so we definitely can't use rt_rq to detect all the tasks.

> BTW, since you faced this problem, I guess this is on RT_GROUP_SCHED
> enabled boxes, so I'd have a couple of questions (not strictly related
> to the problem at hand):
> 
>  - do such boxes rely on RT throttling being performed at root level?
>  - is RT_RUNTIME_SHARE enabled?

This is a machine with many fair_sched_class tasks, while there are no
(almost) RT tasks there, and there is no "real" real-time with throttling.
They are not interesting from this point of view. Very small number
of task group may have rt_runtime != 0 there, and where so, rt_runtime value
is small. The problem I'm solving happened, when rt_period of one of such groups
were restored, and it hanged for ages, because of enormous number of child
cgroups and tasks (of fair_sched_class) in system.

RT_RUNTIME_SHARE is not enabled, as it's RH7-based 3.10 kernel.

Really, it will be very difficult to provide real-time on machines with
such the big number of tasks, in case of tasks allocates some resources
(not all the memory are pinned from going to slab, there is some IO, etc),
since there are some tasks-to-solve even in !rt case.

Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20 10:06     ` [PATCH v2] " Kirill Tkhai
@ 2018-04-20 14:11       ` Juri Lelli
  2018-04-20 14:30         ` Kirill Tkhai
  2018-04-25 15:42       ` Kirill Tkhai
  2018-04-25 19:49       ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Juri Lelli @ 2018-04-20 14:11 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: mingo, peterz, linux-kernel

On 20/04/18 13:06, Kirill Tkhai wrote:
> From: Kirill Tkhai <ktkhai@virtuozzo.com>
> 
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
> 
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
> 
> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>     [exception RIP: tg_rt_schedulable+463]
>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> 
> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> task group process list instead of iteration over all tasks list.
> This makes the function to scale well, and reduces its execution
> time.
> 
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before.

This seems to be true. However, I wonder why we are OK with current racy
code (against tasks moving between groups). :/

Can't a task join the group while we are iterating and we miss that?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20 14:11       ` Juri Lelli
@ 2018-04-20 14:30         ` Kirill Tkhai
  2018-04-20 15:27           ` Juri Lelli
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-20 14:30 UTC (permalink / raw)
  To: Juri Lelli; +Cc: mingo, peterz, linux-kernel

On 20.04.2018 17:11, Juri Lelli wrote:
> On 20/04/18 13:06, Kirill Tkhai wrote:
>> From: Kirill Tkhai <ktkhai@virtuozzo.com>
>>
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>>     [exception RIP: tg_rt_schedulable+463]
>>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <NMI exception stack> ---
>>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to iterate over
>> task group process list instead of iteration over all tasks list.
>> This makes the function to scale well, and reduces its execution
>> time.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before.
> 
> This seems to be true. However, I wonder why we are OK with current racy
> code (against tasks moving between groups). :/
> 
> Can't a task join the group while we are iterating and we miss that?

Yes, it can, but I'm not sure either this should be considered as problem,
seeing the race design we already have. It's not a real protection, this
place is to warn a person, he does something wrong. We check for zero written
there, but really written "1" will invent the same problems.

There are cgroup_threadgroup_change_begin(task)/cgroup_threadgroup_change_end(task)
to protect from cgroup change, but it seems we can't use it as they have task argument
and aimed to protect single task thread group in the future (despite they take global
percpu_rwsem).

Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20 14:30         ` Kirill Tkhai
@ 2018-04-20 15:27           ` Juri Lelli
  0 siblings, 0 replies; 21+ messages in thread
From: Juri Lelli @ 2018-04-20 15:27 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: mingo, peterz, linux-kernel

On 20/04/18 17:30, Kirill Tkhai wrote:
> On 20.04.2018 17:11, Juri Lelli wrote:
> > On 20/04/18 13:06, Kirill Tkhai wrote:
> >> From: Kirill Tkhai <ktkhai@virtuozzo.com>
> >>
> >> tg_rt_schedulable() iterates over all child task groups,
> >> while tg_has_rt_tasks() iterates over all linked tasks.
> >> In case of systems with big number of tasks, this may
> >> take a lot of time.
> >>
> >> I observed hard LOCKUP on machine with 20000+ processes
> >> after write to "cpu.rt_period_us" of cpu cgroup with
> >> 39 children. The problem occurred because of tasklist_lock
> >> is held for a long time and other processes can't do fork().
> >>
> >> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
> >>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> >>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> >>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> >>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> >>     [exception RIP: tg_rt_schedulable+463]
> >>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
> >>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
> >>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
> >>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
> >>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
> >>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
> >>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >> --- <NMI exception stack> ---
> >>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> >>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> >>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> >>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> >>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> >>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> >> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> >> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> >>
> >> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> >> task group process list instead of iteration over all tasks list.
> >> This makes the function to scale well, and reduces its execution
> >> time.
> >>
> >> Note, that since tasklist_lock doesn't protect a task against
> >> sched_class changing, we don't introduce new races in comparison
> >> to that we had before.
> > 
> > This seems to be true. However, I wonder why we are OK with current racy
> > code (against tasks moving between groups). :/
> > 
> > Can't a task join the group while we are iterating and we miss that?
> 
> Yes, it can, but I'm not sure either this should be considered as problem,
> seeing the race design we already have. It's not a real protection, this
> place is to warn a person, he does something wrong. We check for zero written
> there, but really written "1" will invent the same problems.

Mmm, right. :/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20 10:06     ` [PATCH v2] " Kirill Tkhai
  2018-04-20 14:11       ` Juri Lelli
@ 2018-04-25 15:42       ` Kirill Tkhai
  2018-04-25 19:49       ` Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-25 15:42 UTC (permalink / raw)
  To: Juri Lelli, mingo, peterz; +Cc: linux-kernel

Peter,

could you please to give some comments on this?

Kirill

On 20.04.2018 13:06, Kirill Tkhai wrote:
> From: Kirill Tkhai <ktkhai@virtuozzo.com>
> 
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
> 
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
> 
> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>     [exception RIP: tg_rt_schedulable+463]
>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> 
> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> task group process list instead of iteration over all tasks list.
> This makes the function to scale well, and reduces its execution
> time.
> 
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  kernel/sched/rt.c |   22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7aef6b4e885a..20be796d4e63 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
>   */
>  static DEFINE_MUTEX(rt_constraints_mutex);
>  
> -/* Must be called with tasklist_lock held */
>  static inline int tg_has_rt_tasks(struct task_group *tg)
>  {
> -	struct task_struct *g, *p;
> +	struct css_task_iter it;
> +	struct task_struct *task;
> +	int ret = 0;
>  
>  	/*
>  	 * Autogroups do not have RT tasks; see autogroup_create().
> @@ -2406,12 +2407,15 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>  	if (task_group_is_autogroup(tg))
>  		return 0;
>  
> -	for_each_process_thread(g, p) {
> -		if (rt_task(p) && task_group(p) == tg)
> -			return 1;
> -	}
> +	css_task_iter_start(&tg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it)))
> +		if (rt_task(task)) {
> +			ret = 1;
> +			break;
> +		}
> +	css_task_iter_end(&it);
>  
> -	return 0;
> +	return ret;
>  }
>  
>  struct rt_schedulable_data {
> @@ -2510,7 +2514,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>  		return -EINVAL;
>  
>  	mutex_lock(&rt_constraints_mutex);
> -	read_lock(&tasklist_lock);
>  	err = __rt_schedulable(tg, rt_period, rt_runtime);
>  	if (err)
>  		goto unlock;
> @@ -2528,7 +2531,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>  	}
>  	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
>  unlock:
> -	read_unlock(&tasklist_lock);
>  	mutex_unlock(&rt_constraints_mutex);
>  
>  	return err;
> @@ -2582,9 +2584,7 @@ static int sched_rt_global_constraints(void)
>  	int ret = 0;
>  
>  	mutex_lock(&rt_constraints_mutex);
> -	read_lock(&tasklist_lock);
>  	ret = __rt_schedulable(NULL, 0, 0);
> -	read_unlock(&tasklist_lock);
>  	mutex_unlock(&rt_constraints_mutex);
>  
>  	return ret;
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-19 17:29 [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks() Kirill Tkhai
  2018-04-20  9:25 ` Juri Lelli
@ 2018-04-25 17:55 ` Peter Zijlstra
  2018-04-26  9:26   ` Kirill Tkhai
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-04-25 17:55 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: mingo, linux-kernel, Juri Lelli

On Thu, Apr 19, 2018 at 08:29:01PM +0300, Kirill Tkhai wrote:
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.

So you're actually using RT cgroups?

Some of us recently considered removing that thing entirely because most
distro's don't actually use it at all -- and it's broken from a RT POV.

That would then clear the slate to try and implement something new.

But if you're actually using this stuff, that would complicate matters.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-20 10:06     ` [PATCH v2] " Kirill Tkhai
  2018-04-20 14:11       ` Juri Lelli
  2018-04-25 15:42       ` Kirill Tkhai
@ 2018-04-25 19:49       ` Peter Zijlstra
  2018-04-26  9:54         ` [PATCH v3]sched/rt: Stop " Kirill Tkhai
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2018-04-25 19:49 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Juri Lelli, mingo, linux-kernel

On Fri, Apr 20, 2018 at 01:06:31PM +0300, Kirill Tkhai wrote:
> @@ -2406,12 +2407,15 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>  	if (task_group_is_autogroup(tg))
>  		return 0;
>  
> -	for_each_process_thread(g, p) {
> -		if (rt_task(p) && task_group(p) == tg)
> -			return 1;
> -	}
> +	css_task_iter_start(&tg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it)))
> +		if (rt_task(task)) {
> +			ret = 1;
> +			break;
> +		}

Aside from the missing {} there, the patch looks OK I suppose.

The races found seems somewhat dodgy, but like you argue, they're not in
fact new :/


> +	css_task_iter_end(&it);
>  
> -	return 0;
> +	return ret;
>  }
>  

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-25 17:55 ` Peter Zijlstra
@ 2018-04-26  9:26   ` Kirill Tkhai
  0 siblings, 0 replies; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-26  9:26 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, Juri Lelli

On 25.04.2018 20:55, Peter Zijlstra wrote:
> On Thu, Apr 19, 2018 at 08:29:01PM +0300, Kirill Tkhai wrote:
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
> 
> So you're actually using RT cgroups?

I myself don't use them, but this goes from CRIU project. We try to dump and to restore
every configuration that people may use. Sometimes they configure RT parameters of cpu
cgroup, and criu has to restore them as they was at the moment of dump. So, this is the place,
where we met the hard lockup. I can't say, what forces people to use nested cpu cgroup
for RT...

> Some of us recently considered removing that thing entirely because most
> distro's don't actually use it at all -- and it's broken from a RT POV.
> 
> That would then clear the slate to try and implement something new.
> 
> But if you're actually using this stuff, that would complicate matters.

>From the point of criu project, if you delete them, this will decrease
the number of essence we need to support, so I'm not against that :D

Seriously speaking, it's a surprise for me, that they are broken. It seemed,
everything is nice from RT POV: throttling, boosting, PI...

Kirill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-25 19:49       ` Peter Zijlstra
@ 2018-04-26  9:54         ` Kirill Tkhai
  2020-01-23 21:56           ` Phil Auld
  0 siblings, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2018-04-26  9:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Juri Lelli, mingo, linux-kernel

From: Kirill Tkhai <ktkhai@virtuozzo.com>

tg_rt_schedulable() iterates over all child task groups,
while tg_has_rt_tasks() iterates over all linked tasks.
In case of systems with big number of tasks, this may
take a lot of time.

I observed hard LOCKUP on machine with 20000+ processes
after write to "cpu.rt_period_us" of cpu cgroup with
39 children. The problem occurred because of tasklist_lock
is held for a long time and other processes can't do fork().

PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
 #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
 #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
 #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
 #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
    [exception RIP: tg_rt_schedulable+463]
    RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
    RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
    RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
    R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
    R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
 #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
 #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
 #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
 #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
 #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
#10 [ffff886537ad7f08] sys_write at ffffffff8121faff
#11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d

The patch reworks tg_has_rt_tasks() and makes it to iterate over
task group process list instead of iteration over all tasks list.
This makes the function to scale well, and reduces its execution
time.

Note, that since tasklist_lock doesn't protect a task against
sched_class changing, we don't introduce new races in comparison
to that we had before.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 kernel/sched/rt.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7aef6b4e885a..a40535c604b8 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
  */
 static DEFINE_MUTEX(rt_constraints_mutex);
 
-/* Must be called with tasklist_lock held */
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
-	struct task_struct *g, *p;
+	struct task_struct *task;
+	struct css_task_iter it;
+	int ret = 0;
 
 	/*
 	 * Autogroups do not have RT tasks; see autogroup_create().
@@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
 	if (task_group_is_autogroup(tg))
 		return 0;
 
-	for_each_process_thread(g, p) {
-		if (rt_task(p) && task_group(p) == tg)
-			return 1;
+	css_task_iter_start(&tg->css, 0, &it);
+	while ((task = css_task_iter_next(&it))) {
+		if (rt_task(task)) {
+			ret = 1;
+			break;
+		}
 	}
+	css_task_iter_end(&it);
 
-	return 0;
+	return ret;
 }
 
 struct rt_schedulable_data {
@@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 		return -EINVAL;
 
 	mutex_lock(&rt_constraints_mutex);
-	read_lock(&tasklist_lock);
 	err = __rt_schedulable(tg, rt_period, rt_runtime);
 	if (err)
 		goto unlock;
@@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
 	}
 	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
 unlock:
-	read_unlock(&tasklist_lock);
 	mutex_unlock(&rt_constraints_mutex);
 
 	return err;
@@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
 	int ret = 0;
 
 	mutex_lock(&rt_constraints_mutex);
-	read_lock(&tasklist_lock);
 	ret = __rt_schedulable(NULL, 0, 0);
-	read_unlock(&tasklist_lock);
 	mutex_unlock(&rt_constraints_mutex);
 
 	return ret;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2018-04-26  9:54         ` [PATCH v3]sched/rt: Stop " Kirill Tkhai
@ 2020-01-23 21:56           ` Phil Auld
  2020-01-24  9:09             ` Kirill Tkhai
  2020-01-27 16:43             ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Phil Auld @ 2020-01-23 21:56 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Peter Zijlstra, Juri Lelli, mingo, linux-kernel

Hi,

On Thu, Apr 26, 2018 at 12:54:41PM +0300 Kirill Tkhai wrote:
> From: Kirill Tkhai <ktkhai@virtuozzo.com>
> 
> tg_rt_schedulable() iterates over all child task groups,
> while tg_has_rt_tasks() iterates over all linked tasks.
> In case of systems with big number of tasks, this may
> take a lot of time.
> 
> I observed hard LOCKUP on machine with 20000+ processes
> after write to "cpu.rt_period_us" of cpu cgroup with
> 39 children. The problem occurred because of tasklist_lock
> is held for a long time and other processes can't do fork().
> 
> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>     [exception RIP: tg_rt_schedulable+463]
>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> 
> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> task group process list instead of iteration over all tasks list.
> This makes the function to scale well, and reduces its execution
> time.
> 
> Note, that since tasklist_lock doesn't protect a task against
> sched_class changing, we don't introduce new races in comparison
> to that we had before.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
>  kernel/sched/rt.c |   21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 7aef6b4e885a..a40535c604b8 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
>   */
>  static DEFINE_MUTEX(rt_constraints_mutex);
>  
> -/* Must be called with tasklist_lock held */
>  static inline int tg_has_rt_tasks(struct task_group *tg)
>  {
> -	struct task_struct *g, *p;
> +	struct task_struct *task;
> +	struct css_task_iter it;
> +	int ret = 0;
>  
>  	/*
>  	 * Autogroups do not have RT tasks; see autogroup_create().
> @@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>  	if (task_group_is_autogroup(tg))
>  		return 0;
>  
> -	for_each_process_thread(g, p) {
> -		if (rt_task(p) && task_group(p) == tg)
> -			return 1;
> +	css_task_iter_start(&tg->css, 0, &it);
> +	while ((task = css_task_iter_next(&it))) {
> +		if (rt_task(task)) {
> +			ret = 1;
> +			break;
> +		}
>  	}
> +	css_task_iter_end(&it);
>  
> -	return 0;
> +	return ret;
>  }
>  
>  struct rt_schedulable_data {
> @@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>  		return -EINVAL;
>  
>  	mutex_lock(&rt_constraints_mutex);
> -	read_lock(&tasklist_lock);
>  	err = __rt_schedulable(tg, rt_period, rt_runtime);
>  	if (err)
>  		goto unlock;
> @@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>  	}
>  	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
>  unlock:
> -	read_unlock(&tasklist_lock);
>  	mutex_unlock(&rt_constraints_mutex);
>  
>  	return err;
> @@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
>  	int ret = 0;
>  
>  	mutex_lock(&rt_constraints_mutex);
> -	read_lock(&tasklist_lock);
>  	ret = __rt_schedulable(NULL, 0, 0);
> -	read_unlock(&tasklist_lock);
>  	mutex_unlock(&rt_constraints_mutex);
>  
>  	return ret;

Sorry to necro this...  

I have a customer case that has hit the issue here. It looks like this
fix didn't end up going in. 

The only way I could reproduce it myself was to add a udelay(2) to 
the tg_has_rt_tasks loop. With this patch applied, even with the 
udelay nothing bad happens.

Kirill, did you have a good way to reproduce it without adding 
delays? 

Peter, is there any chance of taking something like this?

Is there a better fix? 


Thanks,
Phil

-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2020-01-23 21:56           ` Phil Auld
@ 2020-01-24  9:09             ` Kirill Tkhai
  2020-01-27 16:30               ` Phil Auld
  2020-01-27 16:43             ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Kirill Tkhai @ 2020-01-24  9:09 UTC (permalink / raw)
  To: Phil Auld; +Cc: Peter Zijlstra, Juri Lelli, mingo, linux-kernel

On 24.01.2020 00:56, Phil Auld wrote:
> Hi,
> 
> On Thu, Apr 26, 2018 at 12:54:41PM +0300 Kirill Tkhai wrote:
>> From: Kirill Tkhai <ktkhai@virtuozzo.com>
>>
>> tg_rt_schedulable() iterates over all child task groups,
>> while tg_has_rt_tasks() iterates over all linked tasks.
>> In case of systems with big number of tasks, this may
>> take a lot of time.
>>
>> I observed hard LOCKUP on machine with 20000+ processes
>> after write to "cpu.rt_period_us" of cpu cgroup with
>> 39 children. The problem occurred because of tasklist_lock
>> is held for a long time and other processes can't do fork().
>>
>> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
>>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
>>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
>>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
>>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
>>     [exception RIP: tg_rt_schedulable+463]
>>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
>>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
>>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
>>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
>>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
>>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
>>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> --- <NMI exception stack> ---
>>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
>>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
>>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
>>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
>>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
>>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
>> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
>> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
>>
>> The patch reworks tg_has_rt_tasks() and makes it to iterate over
>> task group process list instead of iteration over all tasks list.
>> This makes the function to scale well, and reduces its execution
>> time.
>>
>> Note, that since tasklist_lock doesn't protect a task against
>> sched_class changing, we don't introduce new races in comparison
>> to that we had before.
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>  kernel/sched/rt.c |   21 +++++++++++----------
>>  1 file changed, 11 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
>> index 7aef6b4e885a..a40535c604b8 100644
>> --- a/kernel/sched/rt.c
>> +++ b/kernel/sched/rt.c
>> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
>>   */
>>  static DEFINE_MUTEX(rt_constraints_mutex);
>>  
>> -/* Must be called with tasklist_lock held */
>>  static inline int tg_has_rt_tasks(struct task_group *tg)
>>  {
>> -	struct task_struct *g, *p;
>> +	struct task_struct *task;
>> +	struct css_task_iter it;
>> +	int ret = 0;
>>  
>>  	/*
>>  	 * Autogroups do not have RT tasks; see autogroup_create().
>> @@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
>>  	if (task_group_is_autogroup(tg))
>>  		return 0;
>>  
>> -	for_each_process_thread(g, p) {
>> -		if (rt_task(p) && task_group(p) == tg)
>> -			return 1;
>> +	css_task_iter_start(&tg->css, 0, &it);
>> +	while ((task = css_task_iter_next(&it))) {
>> +		if (rt_task(task)) {
>> +			ret = 1;
>> +			break;
>> +		}
>>  	}
>> +	css_task_iter_end(&it);
>>  
>> -	return 0;
>> +	return ret;
>>  }
>>  
>>  struct rt_schedulable_data {
>> @@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>>  		return -EINVAL;
>>  
>>  	mutex_lock(&rt_constraints_mutex);
>> -	read_lock(&tasklist_lock);
>>  	err = __rt_schedulable(tg, rt_period, rt_runtime);
>>  	if (err)
>>  		goto unlock;
>> @@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
>>  	}
>>  	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
>>  unlock:
>> -	read_unlock(&tasklist_lock);
>>  	mutex_unlock(&rt_constraints_mutex);
>>  
>>  	return err;
>> @@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
>>  	int ret = 0;
>>  
>>  	mutex_lock(&rt_constraints_mutex);
>> -	read_lock(&tasklist_lock);
>>  	ret = __rt_schedulable(NULL, 0, 0);
>> -	read_unlock(&tasklist_lock);
>>  	mutex_unlock(&rt_constraints_mutex);
>>  
>>  	return ret;
> 
> Sorry to necro this...  
> 
> I have a customer case that has hit the issue here. It looks like this
> fix didn't end up going in. 
> 
> The only way I could reproduce it myself was to add a udelay(2) to 
> the tg_has_rt_tasks loop. With this patch applied, even with the 
> udelay nothing bad happens.
> 
> Kirill, did you have a good way to reproduce it without adding 
> delays? 

I have no a reproducer, it just used to fire on some of our customer nodes.
It depends on many parameters, and main is workload. Also, I think, interrupts
affinity may be involved (whether they bound to small subset of cpus, or they
distributed over all cores). If this is possible theoretically, it occurs
practically with some probability.

We fixed this with out-of-tree patch in our vzkernel tree, so we haven't seen
a reproduction anymore.

> Peter, is there any chance of taking something like this?
> 
> Is there a better fix? 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2020-01-24  9:09             ` Kirill Tkhai
@ 2020-01-27 16:30               ` Phil Auld
  0 siblings, 0 replies; 21+ messages in thread
From: Phil Auld @ 2020-01-27 16:30 UTC (permalink / raw)
  To: Kirill Tkhai; +Cc: Peter Zijlstra, Juri Lelli, mingo, linux-kernel

On Fri, Jan 24, 2020 at 12:09:20PM +0300 Kirill Tkhai wrote:
> On 24.01.2020 00:56, Phil Auld wrote:
> > Hi,
> > 
> > On Thu, Apr 26, 2018 at 12:54:41PM +0300 Kirill Tkhai wrote:
> >> From: Kirill Tkhai <ktkhai@virtuozzo.com>
> >>
> >> tg_rt_schedulable() iterates over all child task groups,
> >> while tg_has_rt_tasks() iterates over all linked tasks.
> >> In case of systems with big number of tasks, this may
> >> take a lot of time.
> >>
> >> I observed hard LOCKUP on machine with 20000+ processes
> >> after write to "cpu.rt_period_us" of cpu cgroup with
> >> 39 children. The problem occurred because of tasklist_lock
> >> is held for a long time and other processes can't do fork().
> >>
> >> PID: 1036268  TASK: ffff88766c310000  CPU: 36  COMMAND: "criu"
> >>  #0 [ffff887f7f408e48] crash_nmi_callback at ffffffff81050601
> >>  #1 [ffff887f7f408e58] nmi_handle at ffffffff816e0cc7
> >>  #2 [ffff887f7f408eb0] do_nmi at ffffffff816e0fb0
> >>  #3 [ffff887f7f408ef0] end_repeat_nmi at ffffffff816e00b9
> >>     [exception RIP: tg_rt_schedulable+463]
> >>     RIP: ffffffff810bf49f  RSP: ffff886537ad7d50  RFLAGS: 00000202
> >>     RAX: 0000000000000000  RBX: 000000003b9aca00  RCX: ffff883e9cb4b1b0
> >>     RDX: ffff887d0be43608  RSI: ffff886537ad7dd8  RDI: ffff8840a6ad0000
> >>     RBP: ffff886537ad7d68   R8: ffff887d0be431b0   R9: 00000000000e7ef0
> >>     R10: ffff88164fc39400  R11: 0000000000023380  R12: ffffffff81ef8d00
> >>     R13: ffffffff810bea40  R14: 0000000000000000  R15: ffff8840a6ad0000
> >>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >> --- <NMI exception stack> ---
> >>  #4 [ffff886537ad7d50] tg_rt_schedulable at ffffffff810bf49f
> >>  #5 [ffff886537ad7d70] walk_tg_tree_from at ffffffff810c6c91
> >>  #6 [ffff886537ad7dc0] tg_set_rt_bandwidth at ffffffff810c6dd0
> >>  #7 [ffff886537ad7e28] cpu_rt_period_write_uint at ffffffff810c6eea
> >>  #8 [ffff886537ad7e38] cgroup_file_write at ffffffff8111cfd3
> >>  #9 [ffff886537ad7ec8] vfs_write at ffffffff8121eced
> >> #10 [ffff886537ad7f08] sys_write at ffffffff8121faff
> >> #11 [ffff886537ad7f50] system_call_fastpath at ffffffff816e8a7d
> >>
> >> The patch reworks tg_has_rt_tasks() and makes it to iterate over
> >> task group process list instead of iteration over all tasks list.
> >> This makes the function to scale well, and reduces its execution
> >> time.
> >>
> >> Note, that since tasklist_lock doesn't protect a task against
> >> sched_class changing, we don't introduce new races in comparison
> >> to that we had before.
> >>
> >> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> >> ---
> >>  kernel/sched/rt.c |   21 +++++++++++----------
> >>  1 file changed, 11 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> >> index 7aef6b4e885a..a40535c604b8 100644
> >> --- a/kernel/sched/rt.c
> >> +++ b/kernel/sched/rt.c
> >> @@ -2395,10 +2395,11 @@ const struct sched_class rt_sched_class = {
> >>   */
> >>  static DEFINE_MUTEX(rt_constraints_mutex);
> >>  
> >> -/* Must be called with tasklist_lock held */
> >>  static inline int tg_has_rt_tasks(struct task_group *tg)
> >>  {
> >> -	struct task_struct *g, *p;
> >> +	struct task_struct *task;
> >> +	struct css_task_iter it;
> >> +	int ret = 0;
> >>  
> >>  	/*
> >>  	 * Autogroups do not have RT tasks; see autogroup_create().
> >> @@ -2406,12 +2407,16 @@ static inline int tg_has_rt_tasks(struct task_group *tg)
> >>  	if (task_group_is_autogroup(tg))
> >>  		return 0;
> >>  
> >> -	for_each_process_thread(g, p) {
> >> -		if (rt_task(p) && task_group(p) == tg)
> >> -			return 1;
> >> +	css_task_iter_start(&tg->css, 0, &it);
> >> +	while ((task = css_task_iter_next(&it))) {
> >> +		if (rt_task(task)) {
> >> +			ret = 1;
> >> +			break;
> >> +		}
> >>  	}
> >> +	css_task_iter_end(&it);
> >>  
> >> -	return 0;
> >> +	return ret;
> >>  }
> >>  
> >>  struct rt_schedulable_data {
> >> @@ -2510,7 +2515,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> >>  		return -EINVAL;
> >>  
> >>  	mutex_lock(&rt_constraints_mutex);
> >> -	read_lock(&tasklist_lock);
> >>  	err = __rt_schedulable(tg, rt_period, rt_runtime);
> >>  	if (err)
> >>  		goto unlock;
> >> @@ -2528,7 +2532,6 @@ static int tg_set_rt_bandwidth(struct task_group *tg,
> >>  	}
> >>  	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
> >>  unlock:
> >> -	read_unlock(&tasklist_lock);
> >>  	mutex_unlock(&rt_constraints_mutex);
> >>  
> >>  	return err;
> >> @@ -2582,9 +2585,7 @@ static int sched_rt_global_constraints(void)
> >>  	int ret = 0;
> >>  
> >>  	mutex_lock(&rt_constraints_mutex);
> >> -	read_lock(&tasklist_lock);
> >>  	ret = __rt_schedulable(NULL, 0, 0);
> >> -	read_unlock(&tasklist_lock);
> >>  	mutex_unlock(&rt_constraints_mutex);
> >>  
> >>  	return ret;
> > 
> > Sorry to necro this...  
> > 
> > I have a customer case that has hit the issue here. It looks like this
> > fix didn't end up going in. 
> > 
> > The only way I could reproduce it myself was to add a udelay(2) to 
> > the tg_has_rt_tasks loop. With this patch applied, even with the 
> > udelay nothing bad happens.
> > 
> > Kirill, did you have a good way to reproduce it without adding 
> > delays? 
> 
> I have no a reproducer, it just used to fire on some of our customer nodes.
> It depends on many parameters, and main is workload. Also, I think, interrupts
> affinity may be involved (whether they bound to small subset of cpus, or they
> distributed over all cores). If this is possible theoretically, it occurs
> practically with some probability.
> 
> We fixed this with out-of-tree patch in our vzkernel tree, so we haven't seen
> a reproduction anymore.
> 

Thanks Kirill. 


Fwiw, it looks like Konstantin Khlebnikov has picked up this change in 
a newer patch, so if that can get in then this would be cleared up too.

https://lkml.org/lkml/2020/1/25/167


Cheers,
Phil


> > Peter, is there any chance of taking something like this?
> > 
> > Is there a better fix? 
> 

-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2020-01-23 21:56           ` Phil Auld
  2020-01-24  9:09             ` Kirill Tkhai
@ 2020-01-27 16:43             ` Peter Zijlstra
  2020-01-27 16:56               ` Phil Auld
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-01-27 16:43 UTC (permalink / raw)
  To: Phil Auld; +Cc: Kirill Tkhai, Juri Lelli, mingo, linux-kernel

On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> Peter, is there any chance of taking something like this?

Whoopsy, looks like this fell on the floor. Can do I suppose.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2020-01-27 16:43             ` Peter Zijlstra
@ 2020-01-27 16:56               ` Phil Auld
  2020-01-27 17:00                 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Phil Auld @ 2020-01-27 16:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kirill Tkhai, Juri Lelli, mingo, linux-kernel

On Mon, Jan 27, 2020 at 05:43:15PM +0100 Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> > Peter, is there any chance of taking something like this?
> 
> Whoopsy, looks like this fell on the floor. Can do I suppose.
> 
> Thanks!
> 

Thanks. Probably make sense at this point to use Konstantin's new version?
But they both do the trick.


Cheers,
Phil

-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2020-01-27 16:56               ` Phil Auld
@ 2020-01-27 17:00                 ` Peter Zijlstra
  2020-01-27 17:45                   ` Phil Auld
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2020-01-27 17:00 UTC (permalink / raw)
  To: Phil Auld; +Cc: Kirill Tkhai, Juri Lelli, mingo, linux-kernel

On Mon, Jan 27, 2020 at 11:56:38AM -0500, Phil Auld wrote:
> On Mon, Jan 27, 2020 at 05:43:15PM +0100 Peter Zijlstra wrote:
> > On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> > > Peter, is there any chance of taking something like this?
> > 
> > Whoopsy, looks like this fell on the floor. Can do I suppose.
> > 
> > Thanks!
> > 
> 
> Thanks. Probably make sense at this point to use Konstantin's new version?
> But they both do the trick.

Care to send it along?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3]sched/rt: Stop for_each_process_thread() iterations in tg_has_rt_tasks()
  2020-01-27 17:00                 ` Peter Zijlstra
@ 2020-01-27 17:45                   ` Phil Auld
  0 siblings, 0 replies; 21+ messages in thread
From: Phil Auld @ 2020-01-27 17:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Kirill Tkhai, Juri Lelli, mingo, linux-kernel

On Mon, Jan 27, 2020 at 06:00:10PM +0100 Peter Zijlstra wrote:
> On Mon, Jan 27, 2020 at 11:56:38AM -0500, Phil Auld wrote:
> > On Mon, Jan 27, 2020 at 05:43:15PM +0100 Peter Zijlstra wrote:
> > > On Thu, Jan 23, 2020 at 04:56:19PM -0500, Phil Auld wrote:
> > > > Peter, is there any chance of taking something like this?
> > > 
> > > Whoopsy, looks like this fell on the floor. Can do I suppose.
> > > 
> > > Thanks!
> > > 
> > 
> > Thanks. Probably make sense at this point to use Konstantin's new version?
> > But they both do the trick.
> 
> Care to send it along?
> 

I think you were on the CC, but here's a link:

https://lkml.org/lkml/2020/1/25/167

It's called: 

[PATCH] sched/rt: optimize checking group rt scheduler constraints

It does a little bit more and needs an "|=" changed to "=". And actually
looking at it again, it should probably just have a break in the loop.

I'll make that comment in the thread for it.


Thanks,
Phil

-- 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-01-27 17:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19 17:29 [PATCH] sched/rt: Rework for_each_process_thread() iterations in tg_has_rt_tasks() Kirill Tkhai
2018-04-20  9:25 ` Juri Lelli
2018-04-20  9:43   ` Kirill Tkhai
2018-04-20 10:06     ` [PATCH v2] " Kirill Tkhai
2018-04-20 14:11       ` Juri Lelli
2018-04-20 14:30         ` Kirill Tkhai
2018-04-20 15:27           ` Juri Lelli
2018-04-25 15:42       ` Kirill Tkhai
2018-04-25 19:49       ` Peter Zijlstra
2018-04-26  9:54         ` [PATCH v3]sched/rt: Stop " Kirill Tkhai
2020-01-23 21:56           ` Phil Auld
2020-01-24  9:09             ` Kirill Tkhai
2020-01-27 16:30               ` Phil Auld
2020-01-27 16:43             ` Peter Zijlstra
2020-01-27 16:56               ` Phil Auld
2020-01-27 17:00                 ` Peter Zijlstra
2020-01-27 17:45                   ` Phil Auld
2018-04-20 10:58     ` [PATCH] sched/rt: Rework " Juri Lelli
2018-04-20 11:21       ` Kirill Tkhai
2018-04-25 17:55 ` Peter Zijlstra
2018-04-26  9:26   ` Kirill Tkhai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.