All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: update_curr changed sum_exec_runtime to 1 when sum_exec_runtime is 0 beacuse some kernel code use sum_exec_runtime==0 to test task just be forked.
@ 2019-08-26 11:46 QiaoChong
  2019-08-27 14:36 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: QiaoChong @ 2019-08-26 11:46 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: linux-kernel, QiaoChong

From: Chong Qiao <qiaochong@loongson.cn>

Such as:
cpu_cgroup_attach>
 sched_move_task>
  task_change_group_fair>
   task_move_group_fair>
    detach_task_cfs_rq>
     vruntime_normalized>

	/*
	 * When !on_rq, vruntime of the task has usually NOT been normalized.
	 * But there are some cases where it has already been normalized:
	 *
	 * - A forked child which is waiting for being woken up by
	 *   wake_up_new_task().
	 * - A task which has been woken up by try_to_wake_up() and
	 *   waiting for actually being woken up by sched_ttwu_pending().
	 */
	if (!se->sum_exec_runtime ||
	    (p->state == TASK_WAKING && p->sched_remote_wakeup))
		return true;

p->se.sum_exec_runtime is 0, does not mean task not been run (A forked child which is waiting for being woken up by  wake_up_new_task()).

Task may have been scheduled multimes, but p->se.sum_exec_runtime is still 0, because delta_exec maybe 0 in update_curr.

static void update_curr(struct cfs_rq *cfs_rq)
{
...
	delta_exec = now - curr->exec_start;
	if (unlikely((s64)delta_exec <= 0))
		return;
...

	curr->sum_exec_runtime += delta_exec;
...
}

Task has been run and is stopped(on_rq == 0), vruntime not been normalized, but se->sum_exec_runtime == 0.
This cause vruntime_normalized set on_rq 1, and does not normalize vruntime.
This may cause task use old vruntime in old cgroup, which maybe very large than task's vruntime in new cgroup.
Which may cause task may not scheduled in run queue for long time after been waked up.

Now I change sum_exec_runtime to 1 when sum_exec_runtime == 0 in update_curr to make sun_exec_runtime not 0.

Signed-off-by: Chong Qiao <qiaochong@loongson.cn>
---
 kernel/sched/fair.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd2..c4cd6249d9d2c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -841,8 +841,15 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		return;
 
 	delta_exec = now - curr->exec_start;
-	if (unlikely((s64)delta_exec <= 0))
+	if (unlikely((s64)delta_exec <= 0)) {
+		/*
+		 * se.sum_exec_runtime == 0 means task is forked and wait to be run.
+		 * So here change se.sum_exec_runtime to 1, to make it not 0.
+		 */
+		if (!curr->sum_exec_runtime)
+			curr->sum_exec_runtime++;
 		return;
+	}
 
 	curr->exec_start = now;
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched/fair: update_curr changed sum_exec_runtime to 1 when sum_exec_runtime is 0 beacuse some kernel code use sum_exec_runtime==0 to test task just be forked.
  2019-08-26 11:46 [PATCH] sched/fair: update_curr changed sum_exec_runtime to 1 when sum_exec_runtime is 0 beacuse some kernel code use sum_exec_runtime==0 to test task just be forked QiaoChong
@ 2019-08-27 14:36 ` Peter Zijlstra
  2019-08-28  5:02   ` 乔崇
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2019-08-27 14:36 UTC (permalink / raw)
  To: QiaoChong; +Cc: Ingo Molnar, linux-kernel

On Mon, Aug 26, 2019 at 07:46:50PM +0800, QiaoChong wrote:
> From: Chong Qiao <qiaochong@loongson.cn>
> 
> Such as:
> cpu_cgroup_attach>
>  sched_move_task>
>   task_change_group_fair>
>    task_move_group_fair>
>     detach_task_cfs_rq>
>      vruntime_normalized>
> 
> 	/*
> 	 * When !on_rq, vruntime of the task has usually NOT been normalized.
> 	 * But there are some cases where it has already been normalized:
> 	 *
> 	 * - A forked child which is waiting for being woken up by
> 	 *   wake_up_new_task().
> 	 * - A task which has been woken up by try_to_wake_up() and
> 	 *   waiting for actually being woken up by sched_ttwu_pending().
> 	 */
> 	if (!se->sum_exec_runtime ||
> 	    (p->state == TASK_WAKING && p->sched_remote_wakeup))
> 		return true;
> 
> p->se.sum_exec_runtime is 0, does not mean task not been run (A forked child which is waiting for being woken up by  wake_up_new_task()).
> 
> Task may have been scheduled multimes, but p->se.sum_exec_runtime is still 0, because delta_exec maybe 0 in update_curr.
> 
> static void update_curr(struct cfs_rq *cfs_rq)
> {
> ...
> 	delta_exec = now - curr->exec_start;
> 	if (unlikely((s64)delta_exec <= 0))
> 		return;
> ...
> 
> 	curr->sum_exec_runtime += delta_exec;
> ...
> }
> 
> Task has been run and is stopped(on_rq == 0), vruntime not been normalized, but se->sum_exec_runtime == 0.
> This cause vruntime_normalized set on_rq 1, and does not normalize vruntime.
> This may cause task use old vruntime in old cgroup, which maybe very large than task's vruntime in new cgroup.
> Which may cause task may not scheduled in run queue for long time after been waked up.
> 
> Now I change sum_exec_runtime to 1 when sum_exec_runtime == 0 in update_curr to make sun_exec_runtime not 0.

Have you actually observed this? It is very hard to have a 0 delta
between two scheduling events.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched/fair: update_curr changed sum_exec_runtime to 1 when sum_exec_runtime is 0 beacuse some kernel code use sum_exec_runtime==0 to test task just be forked.
  2019-08-27 14:36 ` Peter Zijlstra
@ 2019-08-28  5:02   ` 乔崇
  0 siblings, 0 replies; 3+ messages in thread
From: 乔崇 @ 2019-08-28  5:02 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel

I find this when I run a docker test script.

#-----script begin--

for((i=0; i<1000000; i++))
do
     echo "=================================== test $i start 
================================"
     rm -rf /media/ram/docker/test/*
     docker run -d --name test -h test -v 
/media/docker/app/dataindex:/media/dataindex -v 
/media/ram/docker/test:/run -v /media/docker/app/test:/media/container 
-v /media/wave:/media/wave:ro -v /run/systemd/journal/dev-log:/dev/log 
--ipc=host --log-driver=journald --init arpubt:mips64el-1.2

     docker ps -s
     docker stats --no-stream

     cp /media/arp/testsvc.tar.gz /media/ram/docker/test/
     docker exec -it test app-manager install testsvc
     sleep 3

     docker ps -s
     docker stats --no-stream
     docker exec -it test monit status
     sleep 3

     docker exec -it test app-manager disable testsvc
     sleep 1

     docker exec -it test monit status
     sleep 3

     docker exec -it test app-manager enable testsvc
     sleep 1

     docker exec -it test app-manager disable testsvc
     sleep 1

     docker exec -it test app-manager enable testsvc
     sleep 1

     docker exec -it test app-manager disable testsvc
     sleep 1

     docker exec -it test app-manager enable testsvc
     sleep 1

     docker ps -s
     docker stats --no-stream
     sleep 1

     docker exec -it test app-manager uninstall testsvc
     sleep 1

     docker rm -f test

     sleep 1
     docker ps -s
     docker stats --no-stream

     echo "================================== test $i end 
=================================="
done

#---- script end---

some times, docker exec will stop longtime.

testsvc is a busy loop application.

int i;

int main()

{

while(1) i++;

}

When use kgdb, I see there are two tasks on run queue,  docker exec's 
vruntime is very large, but which is  value is almost same to  linux 
cgroup's min_vruntime. like bellow:

CFS0 0x98000002de50f6e0, enter cgroup 0xffffffff81349900
CFS1 0x98000002dba65700, enter cgroup 0x98000002dbc4a700
SE:0x98000002dab95618 vruntime:49388638792 14348 14344 runc:[2:INIT] 
ra=0xffffffff80e8e110 sp=0x98000002d2b33d40 epc=0x12006b0f0 
usp=0xffeafc6810 task=0x98000002dab955b0 tinfo=0x98000002d2b30000 
uregs=0x98000002d2b33eb0 stat=0x1
CFS2 0x98000002dba57700, enter cgroup 0x98000000014bf100
se:0x98000002daa92a78 vruntime:49391638792 (always run)14266 13758 
testsvc ra=0xffffffff80e8e110 sp=0x98000002d2cbfe30 epc=0x1200017f0 
usp=0xffffc8bc10 task=0x98000002daa92a10 tinfo=0x98000002d2cbc000 
uregs=0x98000002d2cbfeb0 stat=0x0
se:0x98000002dab5b488 vruntime:242669408647 (large) 14349 14344 
runc:[2:INIT] ra=0xffffffff80e8e110 sp=0x98000002d2b37ba0 
epc=0x12006b3d8 usp=0xffea7c6798 task=0x98000002dab5b420 
tinfo=0x98000002d2b34000 uregs=0x98000002d2b37eb0 stat=0x0

I add check code in sched_move_task, and found that  task has been run 
and is stopped(on_rq == 0), vruntime is not changed through 
sched_move_task function, because se->sum_exec_runtime == 0

I add debug code in update_curr:

     if (unlikely((s64)delta_exec <= 0))
     {
      if (!curr->sum_exec_runtime)
      {
         struct task_struct *p;
         p = container_of(curr, struct task_struct, se);
         if ((p->nvcsw + p->nivcsw) != 0 && entity_is_task(curr))
         {
          curr->sum_exec_runtime += 1;
          dump_stack();
         }
      }
dump_stack also sometimes  was called.

Now after I read your letter, I know the reason is:

     unsigned long long __weak sched_clock(void)
       {
               return (unsigned long long)(jiffies - INITIAL_JIFFIES)
                                               * (NSEC_PER_SEC / HZ);
      }

My kernel use default sched_clock, which use jiffies as schedule clock.

I should use another high resolution schedule clock, then update_curr's  
dalta_exec will not be 0.

So thanks alot, but I still think curr->sum_exec_runtime check code is 
not accurate enough.


在 2019/8/27 下午10:36, Peter Zijlstra 写道:
> On Mon, Aug 26, 2019 at 07:46:50PM +0800, QiaoChong wrote:
>> From: Chong Qiao <qiaochong@loongson.cn>
>>
>> Such as:
>> cpu_cgroup_attach>
>>   sched_move_task>
>>    task_change_group_fair>
>>     task_move_group_fair>
>>      detach_task_cfs_rq>
>>       vruntime_normalized>
>>
>> 	/*
>> 	 * When !on_rq, vruntime of the task has usually NOT been normalized.
>> 	 * But there are some cases where it has already been normalized:
>> 	 *
>> 	 * - A forked child which is waiting for being woken up by
>> 	 *   wake_up_new_task().
>> 	 * - A task which has been woken up by try_to_wake_up() and
>> 	 *   waiting for actually being woken up by sched_ttwu_pending().
>> 	 */
>> 	if (!se->sum_exec_runtime ||
>> 	    (p->state == TASK_WAKING && p->sched_remote_wakeup))
>> 		return true;
>>
>> p->se.sum_exec_runtime is 0, does not mean task not been run (A forked child which is waiting for being woken up by  wake_up_new_task()).
>>
>> Task may have been scheduled multimes, but p->se.sum_exec_runtime is still 0, because delta_exec maybe 0 in update_curr.
>>
>> static void update_curr(struct cfs_rq *cfs_rq)
>> {
>> ...
>> 	delta_exec = now - curr->exec_start;
>> 	if (unlikely((s64)delta_exec <= 0))
>> 		return;
>> ...
>>
>> 	curr->sum_exec_runtime += delta_exec;
>> ...
>> }
>>
>> Task has been run and is stopped(on_rq == 0), vruntime not been normalized, but se->sum_exec_runtime == 0.
>> This cause vruntime_normalized set on_rq 1, and does not normalize vruntime.
>> This may cause task use old vruntime in old cgroup, which maybe very large than task's vruntime in new cgroup.
>> Which may cause task may not scheduled in run queue for long time after been waked up.
>>
>> Now I change sum_exec_runtime to 1 when sum_exec_runtime == 0 in update_curr to make sun_exec_runtime not 0.
> Have you actually observed this? It is very hard to have a 0 delta
> between two scheduling events.
>

-- 


乔崇 qiaochong@loongson.cn

2019年 08月 28日 星期三 11:31:43 CST


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-08-28  5:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-26 11:46 [PATCH] sched/fair: update_curr changed sum_exec_runtime to 1 when sum_exec_runtime is 0 beacuse some kernel code use sum_exec_runtime==0 to test task just be forked QiaoChong
2019-08-27 14:36 ` Peter Zijlstra
2019-08-28  5:02   ` 乔崇

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.