[PATCH 0/3] sched: some fixes for vruntime calculation related to cgroup movement

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] sched: some fixes for vruntime calculation related to cgroup movement
@ 2011-12-13  6:57 ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:57 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

Hi,

These patches fix problems I could see in 3.2-rc2 when testing frequent cgroup movement
under very high load. Without these patches, some processes were not scheduled
(although they were queued into rq)for a very long time(minutes or hours!),
because vruntime of these processes were far bigger than min_vruntime.

Daisuke Nishimura (3):
  sched: fix cgroup movement of newly created process
  sched: fix cgroup movement of forking process
  sched: fix cgroup movement of waking process

 kernel/sched_fair.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/3] sched: some fixes for vruntime calculation related to cgroup movement
@ 2011-12-13  6:57 ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:57 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

Hi,

These patches fix problems I could see in 3.2-rc2 when testing frequent cgroup movement
under very high load. Without these patches, some processes were not scheduled
(although they were queued into rq)for a very long time(minutes or hours!),
because vruntime of these processes were far bigger than min_vruntime.

Daisuke Nishimura (3):
  sched: fix cgroup movement of newly created process
  sched: fix cgroup movement of forking process
  sched: fix cgroup movement of waking process

 kernel/sched_fair.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)


Thanks,
Daisuke Nishimura.
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-13  6:57   ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:57 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

There is a small race between do_fork() and sched_move_task(), which is trying
to move the child.

            do_fork()                 sched_move_task()
--------------------------------+---------------------------------
  copy_process()
    sched_fork()
      task_fork_fair()
        -> vruntime of the child is initialized
           based on that of the parent.
  -> we can see the child in "tasks" file now.
                                    task_rq_lock()
                                    task_move_group_fair()
                                      ->child.se.vruntime -= (old)cfs_rq->min_vruntime
                                      ->child.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  wake_up_new_task()
    ...
    enqueue_entity()
      child.se->vruntime += cfs_rq->min_vruntime

As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in task_move_group_fair(),
because the vruntime has already been normalized in task_fork_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 kernel/sched_fair.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..df145a9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
-	if (!on_rq)
+	if (!on_rq && p->state != TASK_RUNNING)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq)
+	if (!on_rq && p->state != TASK_RUNNING)
 		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
 }
 #endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-13  6:57   ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:57 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

There is a small race between do_fork() and sched_move_task(), which is trying
to move the child.

            do_fork()                 sched_move_task()
--------------------------------+---------------------------------
  copy_process()
    sched_fork()
      task_fork_fair()
        -> vruntime of the child is initialized
           based on that of the parent.
  -> we can see the child in "tasks" file now.
                                    task_rq_lock()
                                    task_move_group_fair()
                                      ->child.se.vruntime -= (old)cfs_rq->min_vruntime
                                      ->child.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  wake_up_new_task()
    ...
    enqueue_entity()
      child.se->vruntime += cfs_rq->min_vruntime

As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in task_move_group_fair(),
because the vruntime has already been normalized in task_fork_fair().

Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
---
 kernel/sched_fair.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..df145a9 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
-	if (!on_rq)
+	if (!on_rq && p->state != TASK_RUNNING)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq)
+	if (!on_rq && p->state != TASK_RUNNING)
 		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
 }
 #endif
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] sched: fix cgroup movement of forking process
@ 2011-12-13  6:58   ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:58 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.

        task_fork_fair()                 sched_move_task()
--------------------------------+---------------------------------
  cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
  curr = cfs_rq->curr
    -> curr is set to the parent.
                                    task_rq_lock()
                                    dequeue_task()
                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
                                    enqueue_task()
                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  raw_spin_lock_irqsave(rq->lock)
  se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
       which has already been updated by sched_move_task().
  se->vruntime -= (old)cfs_rq->min_vruntime.
  raw_spin_unlock_irqrestore(rq->lock)

As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by setting "cfs_rq" and "curr" after holding
the rq->lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 kernel/sched_fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index df145a9..bdaa4ab 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4787,14 +4787,17 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(current);
-	struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se, *curr;
 	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
+	cfs_rq = task_cfs_rq(current);
+	curr = cfs_rq->curr;
+
 	update_rq_clock(rq);
 
 	if (unlikely(task_cpu(p) != this_cpu)) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] sched: fix cgroup movement of forking process
@ 2011-12-13  6:58   ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:58 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.

        task_fork_fair()                 sched_move_task()
--------------------------------+---------------------------------
  cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
  curr = cfs_rq->curr
    -> curr is set to the parent.
                                    task_rq_lock()
                                    dequeue_task()
                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
                                    enqueue_task()
                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  raw_spin_lock_irqsave(rq->lock)
  se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
       which has already been updated by sched_move_task().
  se->vruntime -= (old)cfs_rq->min_vruntime.
  raw_spin_unlock_irqrestore(rq->lock)

As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by setting "cfs_rq" and "curr" after holding
the rq->lock.

Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
---
 kernel/sched_fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index df145a9..bdaa4ab 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4787,14 +4787,17 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(current);
-	struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se, *curr;
 	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
 	unsigned long flags;
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
+	cfs_rq = task_cfs_rq(current);
+	curr = cfs_rq->curr;
+
 	update_rq_clock(rq);
 
 	if (unlikely(task_cpu(p) != this_cpu)) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] sched: fix cgroup movement of waking process
@ 2011-12-13  6:59   ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:59 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

There is a small race between try_to_wake_up() and sched_move_task(), which is trying
to move the process being woken up.

    try_to_wake_up() on CPU0       sched_move_task() on CPU1
--------------------------------+---------------------------------
  raw_spin_lock_irqsave(p->pi_lock)
  task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
  ttwu_queue()
    ->send reschedule IPI to another CPU1
  raw_spin_unlock_irqsave(p->pi_lock)
                                   task_rq_lock()
                                     -> tring to aquire both p->pi_lock and rq->lock
                                        with IRQ disabled
                                   task_move_group_fair()
                                     ->p.se.vruntime -= (old)cfs_rq->min_vruntime
                                     ->p.se.vruntime += (new)cfs_rq->min_vruntime
                                   task_rq_unlock()

                                   (via IPI)
                                   sched_ttwu_pending()
                                     raw_spin_lock(rq->lock)
                                     ttwu_do_activate()
                                       ...
                                       enqueue_entity()
                                         child.se->vruntime += cfs_rq->min_vruntime
                                     raw_spin_unlock(rq->lock)

As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in task_move_group_fair(),
because the vruntime has already been normalized in task_waking_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 kernel/sched_fair.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bdaa4ab..3feb3a2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4925,10 +4925,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
-	if (!on_rq && p->state != TASK_RUNNING)
+	if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq && p->state != TASK_RUNNING)
+	if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)
 		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
 }
 #endif
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] sched: fix cgroup movement of waking process
@ 2011-12-13  6:59   ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-13  6:59 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra

There is a small race between try_to_wake_up() and sched_move_task(), which is trying
to move the process being woken up.

    try_to_wake_up() on CPU0       sched_move_task() on CPU1
--------------------------------+---------------------------------
  raw_spin_lock_irqsave(p->pi_lock)
  task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
  ttwu_queue()
    ->send reschedule IPI to another CPU1
  raw_spin_unlock_irqsave(p->pi_lock)
                                   task_rq_lock()
                                     -> tring to aquire both p->pi_lock and rq->lock
                                        with IRQ disabled
                                   task_move_group_fair()
                                     ->p.se.vruntime -= (old)cfs_rq->min_vruntime
                                     ->p.se.vruntime += (new)cfs_rq->min_vruntime
                                   task_rq_unlock()

                                   (via IPI)
                                   sched_ttwu_pending()
                                     raw_spin_lock(rq->lock)
                                     ttwu_do_activate()
                                       ...
                                       enqueue_entity()
                                         child.se->vruntime += cfs_rq->min_vruntime
                                     raw_spin_unlock(rq->lock)

As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in task_move_group_fair(),
because the vruntime has already been normalized in task_waking_fair().

Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
---
 kernel/sched_fair.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bdaa4ab..3feb3a2 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -4925,10 +4925,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
-	if (!on_rq && p->state != TASK_RUNNING)
+	if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-	if (!on_rq && p->state != TASK_RUNNING)
+	if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)
 		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
 }
 #endif
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-13 12:01     ` Paul Turner
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Turner @ 2011-12-13 12:01 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra

On 12/12/2011 10:57 PM, Daisuke Nishimura wrote:

> There is a small race between do_fork() and sched_move_task(), which is trying
> to move the child.
>
>             do_fork()                 sched_move_task()
> --------------------------------+---------------------------------
>   copy_process()
>     sched_fork()
>       task_fork_fair()
>         -> vruntime of the child is initialized
>            based on that of the parent.


Hmm, so here vruntime of child is actually initialized to vruntime - min_V

>   -> we can see the child in "tasks" file now.
>                                     task_rq_lock()
>                                     task_move_group_fair()


So since on a regular fork we just copy and don't actually go through
the attach muck I'm assuming this is an external actor who's seen the
child in the tasks file and is moving it?

>                                       ->child.se.vruntime -= (old)cfs_rq->min_vruntime
>                                       ->child.se.vruntime += (new)cfs_rq->min_vruntime


This would then add delta min_V between the two cfs_rqs

>                                     task_rq_unlock()
>   wake_up_new_task()
>     ...
>     enqueue_entity()
>       child.se->vruntime += cfs_rq->min_vruntime

>
> As a result, vruntime of the child becomes far bigger than min_vruntime,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
>
> This patch fixes this problem by just ignoring such process in task_move_group_fair(),
> because the vruntime has already been normalized in task_fork_fair().
>
> Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
> ---This would need an explanatory
>  kernel/sched_fair.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5c9e679..df145a9 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>  	 * to another cgroup's rq. This does somewhat interfere with the
>  	 * fair sleeper stuff for the first placement, but who cares.
>  	 */
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;


We also have the choice of keying off something like
!se.sum_exec_runtime here which is reset in sched_fork() which might
be less fragile/make the problem interaction more obvious to.  Either
way this is really a corner case deserving of an explanatory comment.
This is a little icky but I don't have any wildly better ideas.

>  	set_task_rq(p, task_cpu(p));
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
>  }
>  #endif


Acked-by: Paul Turner <pjt@google.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-13 12:01     ` Paul Turner
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Turner @ 2011-12-13 12:01 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra

On 12/12/2011 10:57 PM, Daisuke Nishimura wrote:

> There is a small race between do_fork() and sched_move_task(), which is trying
> to move the child.
>
>             do_fork()                 sched_move_task()
> --------------------------------+---------------------------------
>   copy_process()
>     sched_fork()
>       task_fork_fair()
>         -> vruntime of the child is initialized
>            based on that of the parent.


Hmm, so here vruntime of child is actually initialized to vruntime - min_V

>   -> we can see the child in "tasks" file now.
>                                     task_rq_lock()
>                                     task_move_group_fair()


So since on a regular fork we just copy and don't actually go through
the attach muck I'm assuming this is an external actor who's seen the
child in the tasks file and is moving it?

>                                       ->child.se.vruntime -= (old)cfs_rq->min_vruntime
>                                       ->child.se.vruntime += (new)cfs_rq->min_vruntime


This would then add delta min_V between the two cfs_rqs

>                                     task_rq_unlock()
>   wake_up_new_task()
>     ...
>     enqueue_entity()
>       child.se->vruntime += cfs_rq->min_vruntime

>
> As a result, vruntime of the child becomes far bigger than min_vruntime,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
>
> This patch fixes this problem by just ignoring such process in task_move_group_fair(),
> because the vruntime has already been normalized in task_fork_fair().
>
> Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> ---This would need an explanatory
>  kernel/sched_fair.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5c9e679..df145a9 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>  	 * to another cgroup's rq. This does somewhat interfere with the
>  	 * fair sleeper stuff for the first placement, but who cares.
>  	 */
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;


We also have the choice of keying off something like
!se.sum_exec_runtime here which is reset in sched_fork() which might
be less fragile/make the problem interaction more obvious to.  Either
way this is really a corner case deserving of an explanatory comment.
This is a little icky but I don't have any wildly better ideas.

>  	set_task_rq(p, task_cpu(p));
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
>  }
>  #endif


Acked-by: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched: fix cgroup movement of forking process
@ 2011-12-13 12:22     ` Paul Turner
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Turner @ 2011-12-13 12:22 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra

On Mon, Dec 12, 2011 at 10:58 PM, Daisuke Nishimura
<nishimura@mxp.nes.nec.co.jp> wrote:
> There is a small race between task_fork_fair() and sched_move_task(),
> which is trying to move the parent.
>
>        task_fork_fair()                 sched_move_task()
> --------------------------------+---------------------------------
>  cfs_rq = task_cfs_rq(current)
>    -> cfs_rq is the "old" one.
>  curr = cfs_rq->curr
>    -> curr is set to the parent.
>                                    task_rq_lock()
>                                    dequeue_task()
>                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
>                                    enqueue_task()
>                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
>                                    task_rq_unlock()
>  raw_spin_lock_irqsave(rq->lock)
>  se->vruntime = curr->vruntime
>    -> vruntime of the child is set to that of the parent
>       which has already been updated by sched_move_task().
>  se->vruntime -= (old)cfs_rq->min_vruntime.
>  raw_spin_unlock_irqrestore(rq->lock)
>
> As a result, vruntime of the child becomes far bigger than expected,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
>
> This patch fixes this problem by setting "cfs_rq" and "curr" after holding
> the rq->lock.
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  kernel/sched_fair.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index df145a9..bdaa4ab 100644
> --- a/kernel/sched_fair.cthis
> +++ b/kernel/sched_fair.c
> @@ -4787,14 +4787,17 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  */this
>  static void task_fork_fair(struct task_struct *p)
>  {
> -       struct cfs_rq *cfs_rq = task_cfs_rq(current);
> -       struct sched_entity *se = &p->se, *curr = cfs_rq->curr;

Strictly speaking we're calling
current->sched_class-(*)->task_fork_fair() so we know current is in
sched_fair, which means it has to be cfs_rq->curr.

Because of that this could become be *curr = &current->se and then
cfs_rq_of(curr) below.

But the current healthy paranoia is ok too.

> +       struct cfs_rq *cfs_rq;
> +       struct sched_entity *se = &p->se, *curr;
>        int this_cpu = smp_processor_id();
>        struct rq *rq = this_rq();
>        unsigned long flags;
>
>        raw_spin_lock_irqsave(&rq->lock, flags);
>
> +       cfs_rq = task_cfs_rq(current);
> +       curr = cfs_rq->curr;
> +
>        update_rq_clock(rq);
>
>        if (unlikely(task_cpu(p) != this_cpu)) {
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Acked-by: Paul Turner <pjt@google.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/3] sched: fix cgroup movement of forking process
@ 2011-12-13 12:22     ` Paul Turner
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Turner @ 2011-12-13 12:22 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra

On Mon, Dec 12, 2011 at 10:58 PM, Daisuke Nishimura
<nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:
> There is a small race between task_fork_fair() and sched_move_task(),
> which is trying to move the parent.
>
>        task_fork_fair()                 sched_move_task()
> --------------------------------+---------------------------------
>  cfs_rq = task_cfs_rq(current)
>    -> cfs_rq is the "old" one.
>  curr = cfs_rq->curr
>    -> curr is set to the parent.
>                                    task_rq_lock()
>                                    dequeue_task()
>                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
>                                    enqueue_task()
>                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
>                                    task_rq_unlock()
>  raw_spin_lock_irqsave(rq->lock)
>  se->vruntime = curr->vruntime
>    -> vruntime of the child is set to that of the parent
>       which has already been updated by sched_move_task().
>  se->vruntime -= (old)cfs_rq->min_vruntime.
>  raw_spin_unlock_irqrestore(rq->lock)
>
> As a result, vruntime of the child becomes far bigger than expected,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
>
> This patch fixes this problem by setting "cfs_rq" and "curr" after holding
> the rq->lock.
>
> Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
> ---
>  kernel/sched_fair.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index df145a9..bdaa4ab 100644
> --- a/kernel/sched_fair.cthis
> +++ b/kernel/sched_fair.c
> @@ -4787,14 +4787,17 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  */this
>  static void task_fork_fair(struct task_struct *p)
>  {
> -       struct cfs_rq *cfs_rq = task_cfs_rq(current);
> -       struct sched_entity *se = &p->se, *curr = cfs_rq->curr;

Strictly speaking we're calling
current->sched_class-(*)->task_fork_fair() so we know current is in
sched_fair, which means it has to be cfs_rq->curr.

Because of that this could become be *curr = &current->se and then
cfs_rq_of(curr) below.

But the current healthy paranoia is ok too.

> +       struct cfs_rq *cfs_rq;
> +       struct sched_entity *se = &p->se, *curr;
>        int this_cpu = smp_processor_id();
>        struct rq *rq = this_rq();
>        unsigned long flags;
>
>        raw_spin_lock_irqsave(&rq->lock, flags);
>
> +       cfs_rq = task_cfs_rq(current);
> +       curr = cfs_rq->curr;
> +
>        update_rq_clock(rq);
>
>        if (unlikely(task_cpu(p) != this_cpu)) {
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Acked-by: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: fix cgroup movement of waking process
@ 2011-12-13 12:24     ` Paul Turner
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Turner @ 2011-12-13 12:24 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra

On Mon, Dec 12, 2011 at 10:59 PM, Daisuke Nishimura
<nishimura@mxp.nes.nec.co.jp> wrote:
> There is a small race between try_to_wake_up() and sched_move_task(), which is trying
> to move the process being woken up.
>
>    try_to_wake_up() on CPU0       sched_move_task() on CPU1
> --------------------------------+---------------------------------
>  raw_spin_lock_irqsave(p->pi_lock)
>  task_waking_fair()
>    ->p.se.vruntime -= cfs_rq->min_vruntime
>  ttwu_queue()
>    ->send reschedule IPI to another CPU1
>  raw_spin_unlock_irqsave(p->pi_lock)
>                                   task_rq_lock()
>                                     -> tring to aquire both p->pi_lock and rq->lock
>                                        with IRQ disabled
>                                   task_move_group_fair()
>                                     ->p.se.vruntime -= (old)cfs_rq->min_vruntime
>                                     ->p.se.vruntime += (new)cfs_rq->min_vruntime
>                                   task_rq_unlock()
>
>                                   (via IPI)
>                                   sched_ttwu_pending()
>                                     raw_spin_lock(rq->lock)
>                                     ttwu_do_activate()
>                                       ...
>                                       enqueue_entity()
>                                         child.se->vruntime += cfs_rq->min_vruntime
>                                     raw_spin_unlock(rq->lock)
>
> As a result, vruntime of the process becomes far bigger than min_vruntime,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
>
> This patch fixes this problem by just ignoring such process in task_move_group_fair(),
> because the vruntime has already been normalized in task_waking_fair().
>
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  kernel/sched_fair.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index bdaa4ab..3feb3a2 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4925,10 +4925,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>         * to another cgroup's rq. This does somewhat interfere with the
>         * fair sleeper stuff for the first placement, but who cares.
>         */
> -       if (!on_rq && p->state != TASK_RUNNING)
> +       if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)

!p->se.sum_exec_runtime is starting to look more attractive here...

>                p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
>        set_task_rq(p, task_cpu(p));
> -       if (!on_rq && p->state != TASK_RUNNING)
> +       if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)
>                p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
>  }
>  #endif
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/3] sched: fix cgroup movement of waking process
@ 2011-12-13 12:24     ` Paul Turner
  0 siblings, 0 replies; 35+ messages in thread
From: Paul Turner @ 2011-12-13 12:24 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra

On Mon, Dec 12, 2011 at 10:59 PM, Daisuke Nishimura
<nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:
> There is a small race between try_to_wake_up() and sched_move_task(), which is trying
> to move the process being woken up.
>
>    try_to_wake_up() on CPU0       sched_move_task() on CPU1
> --------------------------------+---------------------------------
>  raw_spin_lock_irqsave(p->pi_lock)
>  task_waking_fair()
>    ->p.se.vruntime -= cfs_rq->min_vruntime
>  ttwu_queue()
>    ->send reschedule IPI to another CPU1
>  raw_spin_unlock_irqsave(p->pi_lock)
>                                   task_rq_lock()
>                                     -> tring to aquire both p->pi_lock and rq->lock
>                                        with IRQ disabled
>                                   task_move_group_fair()
>                                     ->p.se.vruntime -= (old)cfs_rq->min_vruntime
>                                     ->p.se.vruntime += (new)cfs_rq->min_vruntime
>                                   task_rq_unlock()
>
>                                   (via IPI)
>                                   sched_ttwu_pending()
>                                     raw_spin_lock(rq->lock)
>                                     ttwu_do_activate()
>                                       ...
>                                       enqueue_entity()
>                                         child.se->vruntime += cfs_rq->min_vruntime
>                                     raw_spin_unlock(rq->lock)
>
> As a result, vruntime of the process becomes far bigger than min_vruntime,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
>
> This patch fixes this problem by just ignoring such process in task_move_group_fair(),
> because the vruntime has already been normalized in task_waking_fair().
>
> Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
> ---
>  kernel/sched_fair.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index bdaa4ab..3feb3a2 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4925,10 +4925,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>         * to another cgroup's rq. This does somewhat interfere with the
>         * fair sleeper stuff for the first placement, but who cares.
>         */
> -       if (!on_rq && p->state != TASK_RUNNING)
> +       if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)

!p->se.sum_exec_runtime is starting to look more attractive here...

>                p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
>        set_task_rq(p, task_cpu(p));
> -       if (!on_rq && p->state != TASK_RUNNING)
> +       if (!on_rq && p->state != TASK_RUNNING && p->state != TASK_WAKING)
>                p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
>  }
>  #endif
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-13 12:41     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-12-13 12:41 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar

On Tue, 2011-12-13 at 15:57 +0900, Daisuke Nishimura wrote:

>  kernel/sched_fair.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)

you blink you loose, that file doesn't exist anymore.


> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5c9e679..df145a9 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>  	 * to another cgroup's rq. This does somewhat interfere with the
>  	 * fair sleeper stuff for the first placement, but who cares.
>  	 */
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
>  	set_task_rq(p, task_cpu(p));
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
>  }
>  #endif

The much saner way of writing that is something like:

 /*
  * Comment explaining stuff..
  */
 if (!on_rq && p->state == TASK_RUNNING)
	on_rq = 1;

 ...



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-13 12:41     ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2011-12-13 12:41 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: LKML, cgroups, Ingo Molnar

On Tue, 2011-12-13 at 15:57 +0900, Daisuke Nishimura wrote:

>  kernel/sched_fair.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)

you blink you loose, that file doesn't exist anymore.


> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5c9e679..df145a9 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
>  	 * to another cgroup's rq. This does somewhat interfere with the
>  	 * fair sleeper stuff for the first placement, but who cares.
>  	 */
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
>  	set_task_rq(p, task_cpu(p));
> -	if (!on_rq)
> +	if (!on_rq && p->state != TASK_RUNNING)
>  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
>  }
>  #endif

The much saner way of writing that is something like:

 /*
  * Comment explaining stuff..
  */
 if (!on_rq && p->state == TASK_RUNNING)
	on_rq = 1;

 ...


--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-14  1:04       ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-14  1:04 UTC (permalink / raw)
  To: Paul Turner; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra, Daisuke Nishimura

On Tue, 13 Dec 2011 04:01:21 -0800
Paul Turner <pjt@google.com> wrote:

> On 12/12/2011 10:57 PM, Daisuke Nishimura wrote:
> 
> > There is a small race between do_fork() and sched_move_task(), which is trying
> > to move the child.
> >
> >             do_fork()                 sched_move_task()
> > --------------------------------+---------------------------------
> >   copy_process()
> >     sched_fork()
> >       task_fork_fair()
> >         -> vruntime of the child is initialized
> >            based on that of the parent.
> 
> 
> Hmm, so here vruntime of child is actually initialized to vruntime - min_V
> 
> >   -> we can see the child in "tasks" file now.
> >                                     task_rq_lock()
> >                                     task_move_group_fair()
> 
> 
> So since on a regular fork we just copy and don't actually go through
> the attach muck I'm assuming this is an external actor who's seen the
> child in the tasks file and is moving it?
> 
Right.

> >                                       ->child.se.vruntime -= (old)cfs_rq->min_vruntime
> >                                       ->child.se.vruntime += (new)cfs_rq->min_vruntime
> 
> 
> This would then add delta min_V between the two cfs_rqs
> 
> >                                     task_rq_unlock()
> >   wake_up_new_task()
> >     ...
> >     enqueue_entity()
> >       child.se->vruntime += cfs_rq->min_vruntime
> 
> >
> > As a result, vruntime of the child becomes far bigger than min_vruntime,
> > if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
> >
> > This patch fixes this problem by just ignoring such process in task_move_group_fair(),
> > because the vruntime has already been normalized in task_fork_fair().
> >
> > Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
> > ---This would need an explanatory
> >  kernel/sched_fair.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 5c9e679..df145a9 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
> >  	 * to another cgroup's rq. This does somewhat interfere with the
> >  	 * fair sleeper stuff for the first placement, but who cares.
> >  	 */
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
> 
> 
> We also have the choice of keying off something like
> !se.sum_exec_runtime here which is reset in sched_fork() which might
> be less fragile/make the problem interaction more obvious to.  Either

sum_exec_runtime can be used for a process that has just been created,
but IIUC it cannot be used for a process that is being woken up from
sleeping(in [3/3] case).
Or, do you mean using p->se.sum_exec_runtime for [1/3] and p->state for [3/3] ?
If so, I'll do it in the next post.

> way this is really a corner case deserving of an explanatory comment.
> This is a little icky but I don't have any wildly better ideas.
> 
I'll add comments.

> >  	set_task_rq(p, task_cpu(p));
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
> >  }
> >  #endif
> 
> 
> Acked-by: Paul Turner <pjt@google.com>

Thanks you for your reviews.

Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-14  1:04       ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-14  1:04 UTC (permalink / raw)
  To: Paul Turner; +Cc: LKML, cgroups, Ingo Molnar, Peter Zijlstra, Daisuke Nishimura

On Tue, 13 Dec 2011 04:01:21 -0800
Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> On 12/12/2011 10:57 PM, Daisuke Nishimura wrote:
> 
> > There is a small race between do_fork() and sched_move_task(), which is trying
> > to move the child.
> >
> >             do_fork()                 sched_move_task()
> > --------------------------------+---------------------------------
> >   copy_process()
> >     sched_fork()
> >       task_fork_fair()
> >         -> vruntime of the child is initialized
> >            based on that of the parent.
> 
> 
> Hmm, so here vruntime of child is actually initialized to vruntime - min_V
> 
> >   -> we can see the child in "tasks" file now.
> >                                     task_rq_lock()
> >                                     task_move_group_fair()
> 
> 
> So since on a regular fork we just copy and don't actually go through
> the attach muck I'm assuming this is an external actor who's seen the
> child in the tasks file and is moving it?
> 
Right.

> >                                       ->child.se.vruntime -= (old)cfs_rq->min_vruntime
> >                                       ->child.se.vruntime += (new)cfs_rq->min_vruntime
> 
> 
> This would then add delta min_V between the two cfs_rqs
> 
> >                                     task_rq_unlock()
> >   wake_up_new_task()
> >     ...
> >     enqueue_entity()
> >       child.se->vruntime += cfs_rq->min_vruntime
> 
> >
> > As a result, vruntime of the child becomes far bigger than min_vruntime,
> > if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
> >
> > This patch fixes this problem by just ignoring such process in task_move_group_fair(),
> > because the vruntime has already been normalized in task_fork_fair().
> >
> > Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil-XMD5yJDbdMReXY1tMh2IBg@public.gmane.org>
> > ---This would need an explanatory
> >  kernel/sched_fair.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 5c9e679..df145a9 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
> >  	 * to another cgroup's rq. This does somewhat interfere with the
> >  	 * fair sleeper stuff for the first placement, but who cares.
> >  	 */
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
> 
> 
> We also have the choice of keying off something like
> !se.sum_exec_runtime here which is reset in sched_fork() which might
> be less fragile/make the problem interaction more obvious to.  Either

sum_exec_runtime can be used for a process that has just been created,
but IIUC it cannot be used for a process that is being woken up from
sleeping(in [3/3] case).
Or, do you mean using p->se.sum_exec_runtime for [1/3] and p->state for [3/3] ?
If so, I'll do it in the next post.

> way this is really a corner case deserving of an explanatory comment.
> This is a little icky but I don't have any wildly better ideas.
> 
I'll add comments.

> >  	set_task_rq(p, task_cpu(p));
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
> >  }
> >  #endif
> 
> 
> Acked-by: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Thanks you for your reviews.

Daisuke Nishimura.
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
  2011-12-13 12:41     ` Peter Zijlstra
@ 2011-12-14  1:05       ` Daisuke Nishimura
  -1 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-14  1:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, cgroups, Ingo Molnar, Daisuke Nishimura

On Tue, 13 Dec 2011 13:41:09 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2011-12-13 at 15:57 +0900, Daisuke Nishimura wrote:
> 
> >  kernel/sched_fair.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> you blink you loose, that file doesn't exist anymore.
> 
hmm, indeed.
I'll rebase these patches onto the tip.

> 
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 5c9e679..df145a9 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
> >  	 * to another cgroup's rq. This does somewhat interfere with the
> >  	 * fair sleeper stuff for the first placement, but who cares.
> >  	 */
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
> >  	set_task_rq(p, task_cpu(p));
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
> >  }
> >  #endif
> 
> The much saner way of writing that is something like:
> 
>  /*
>   * Comment explaining stuff..
>   */
>  if (!on_rq && p->state == TASK_RUNNING)
> 	on_rq = 1;
> 
>  ...
> 
will do in the next post.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-14  1:05       ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-14  1:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, cgroups, Ingo Molnar, Daisuke Nishimura

On Tue, 13 Dec 2011 13:41:09 +0100
Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org> wrote:

> On Tue, 2011-12-13 at 15:57 +0900, Daisuke Nishimura wrote:
> 
> >  kernel/sched_fair.c |    4 ++--
> >  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> you blink you loose, that file doesn't exist anymore.
> 
hmm, indeed.
I'll rebase these patches onto the tip.

> 
> > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> > index 5c9e679..df145a9 100644
> > --- a/kernel/sched_fair.c
> > +++ b/kernel/sched_fair.c
> > @@ -4922,10 +4922,10 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
> >  	 * to another cgroup's rq. This does somewhat interfere with the
> >  	 * fair sleeper stuff for the first placement, but who cares.
> >  	 */
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
> >  	set_task_rq(p, task_cpu(p));
> > -	if (!on_rq)
> > +	if (!on_rq && p->state != TASK_RUNNING)
> >  		p->se.vruntime += cfs_rq_of(&p->se)->min_vruntime;
> >  }
> >  #endif
> 
> The much saner way of writing that is something like:
> 
>  /*
>   * Comment explaining stuff..
>   */
>  if (!on_rq && p->state == TASK_RUNNING)
> 	on_rq = 1;
> 
>  ...
> 
will do in the next post.

Thanks,
Daisuke Nishimura.
--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH -tip 0/3] sched: some fixes for vruntime calculation related to cgroup movement(v2)
  2011-12-13  6:57 ` Daisuke Nishimura
                   ` (3 preceding siblings ...)
  (?)
@ 2011-12-15  5:35 ` Daisuke Nishimura
  2011-12-15  5:36     ` Daisuke Nishimura
                     ` (3 more replies)
  -1 siblings, 4 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-15  5:35 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Daisuke Nishimura

Hi, all.

This is an updated version based on comments and rebased onto the -tip tree.

These patches fix problems I could see in 3.2-rc2 when testing frequent cgroup movement
under very high load. Without these patches, some processes were not scheduled
(although they were queued into rq)for a very long time(minutes or hours!),
because vruntime of these processes were far bigger than min_vruntime.

Daisuke Nishimura (3):
  sched: fix cgroup movement of newly created process
  sched: fix cgroup movement of forking process
  sched: fix cgroup movement of waking process

 kernel/sched/fair.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-15  5:36     ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-15  5:36 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Daisuke Nishimura

There is a small race between do_fork() and sched_move_task(), which is
trying to move the child.

            do_fork()                 sched_move_task()
--------------------------------+---------------------------------
  copy_process()
    sched_fork()
      task_fork_fair()
        -> vruntime of the child is initialized
           based on that of the parent.
  -> we can see the child in "tasks" file now.
                                    task_rq_lock()
                                    task_move_group_fair()
                                      -> child.se.vruntime
                                           -= (old)cfs_rq->min_vruntime
                                           += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  wake_up_new_task()
    ...
    enqueue_entity()
      child.se.vruntime += cfs_rq->min_vruntime

As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_fork_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 kernel/sched/fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4d2b7a..43cadd0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5336,6 +5336,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
+	/*
+	 * When !on_rq, vruntime of the task has usually NOT been normalized.
+	 * But there are some cases where it has already been normalized:
+	 *
+	 * - Moving a forked child which is waiting for being woken up by
+	 *   wake_up_new_task().
+	 *
+	 * To prevent boost or penalty in the new cfs_rq caused by delta
+	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
+	 */
+	if (!on_rq && !p->se.sum_exec_runtime)
+		on_rq = 1;
+
 	if (!on_rq)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 1/3] sched: fix cgroup movement of newly created process
@ 2011-12-15  5:36     ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-15  5:36 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Daisuke Nishimura

There is a small race between do_fork() and sched_move_task(), which is
trying to move the child.

            do_fork()                 sched_move_task()
--------------------------------+---------------------------------
  copy_process()
    sched_fork()
      task_fork_fair()
        -> vruntime of the child is initialized
           based on that of the parent.
  -> we can see the child in "tasks" file now.
                                    task_rq_lock()
                                    task_move_group_fair()
                                      -> child.se.vruntime
                                           -= (old)cfs_rq->min_vruntime
                                           += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  wake_up_new_task()
    ...
    enqueue_entity()
      child.se.vruntime += cfs_rq->min_vruntime

As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_fork_fair().

Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
---
 kernel/sched/fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a4d2b7a..43cadd0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5336,6 +5336,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
+	/*
+	 * When !on_rq, vruntime of the task has usually NOT been normalized.
+	 * But there are some cases where it has already been normalized:
+	 *
+	 * - Moving a forked child which is waiting for being woken up by
+	 *   wake_up_new_task().
+	 *
+	 * To prevent boost or penalty in the new cfs_rq caused by delta
+	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
+	 */
+	if (!on_rq && !p->se.sum_exec_runtime)
+		on_rq = 1;
+
 	if (!on_rq)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] sched: fix cgroup movement of forking process
@ 2011-12-15  5:36     ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-15  5:36 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Daisuke Nishimura

There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.

        task_fork_fair()                 sched_move_task()
--------------------------------+---------------------------------
  cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
  curr = cfs_rq->curr
    -> curr is set to the parent.
                                    task_rq_lock()
                                    dequeue_task()
                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
                                    enqueue_task()
                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  raw_spin_lock_irqsave(rq->lock)
  se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
       which has already been updated by sched_move_task().
  se->vruntime -= (old)cfs_rq->min_vruntime.
  raw_spin_unlock_irqrestore(rq->lock)

As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by setting "cfs_rq" and "curr" after
holding the rq->lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43cadd0..b1db01b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5191,8 +5191,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(current);
-	struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se, *curr;
 	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
 	unsigned long flags;
@@ -5201,6 +5201,9 @@ static void task_fork_fair(struct task_struct *p)
 
 	update_rq_clock(rq);
 
+	cfs_rq = task_cfs_rq(current);
+	curr = cfs_rq->curr;
+
 	if (unlikely(task_cpu(p) != this_cpu)) {
 		rcu_read_lock();
 		__set_task_cpu(p, this_cpu);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/3] sched: fix cgroup movement of forking process
@ 2011-12-15  5:36     ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-15  5:36 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Daisuke Nishimura

There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.

        task_fork_fair()                 sched_move_task()
--------------------------------+---------------------------------
  cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
  curr = cfs_rq->curr
    -> curr is set to the parent.
                                    task_rq_lock()
                                    dequeue_task()
                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
                                    enqueue_task()
                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  raw_spin_lock_irqsave(rq->lock)
  se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
       which has already been updated by sched_move_task().
  se->vruntime -= (old)cfs_rq->min_vruntime.
  raw_spin_unlock_irqrestore(rq->lock)

As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by setting "cfs_rq" and "curr" after
holding the rq->lock.

Signed-off-by: Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
Acked-by: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/sched/fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43cadd0..b1db01b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5191,8 +5191,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(current);
-	struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se, *curr;
 	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
 	unsigned long flags;
@@ -5201,6 +5201,9 @@ static void task_fork_fair(struct task_struct *p)
 
 	update_rq_clock(rq);
 
+	cfs_rq = task_cfs_rq(current);
+	curr = cfs_rq->curr;
+
 	if (unlikely(task_cpu(p) != this_cpu)) {
 		rcu_read_lock();
 		__set_task_cpu(p, this_cpu);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe cgroups" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/3] sched: fix cgroup movement of waking process
  2011-12-15  5:35 ` [PATCH -tip 0/3] sched: some fixes for vruntime calculation related to cgroup movement(v2) Daisuke Nishimura
  2011-12-15  5:36     ` Daisuke Nishimura
  2011-12-15  5:36     ` Daisuke Nishimura
@ 2011-12-15  5:37   ` Daisuke Nishimura
  2011-12-21 11:45     ` [tip:sched/core] sched: Fix " tip-bot for Daisuke Nishimura
  2011-12-19  2:55     ` Daisuke Nishimura
  3 siblings, 1 reply; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-15  5:37 UTC (permalink / raw)
  To: LKML, cgroups; +Cc: Ingo Molnar, Peter Zijlstra, Paul Turner, Daisuke Nishimura

There is a small race between try_to_wake_up() and sched_move_task(),
which is trying to move the process being woken up.

    try_to_wake_up() on CPU0       sched_move_task() on CPU1
--------------------------------+---------------------------------
  raw_spin_lock_irqsave(p->pi_lock)
  task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
  ttwu_queue()
    ->send reschedule IPI to CPU1
  raw_spin_unlock_irqsave(p->pi_lock)
                                   task_rq_lock()
                                     -> tring to aquire both p->pi_lock and
                                        rq->lock with IRQ disabled
                                   task_move_group_fair()
                                     -> p.se.vruntime
                                          -= (old)cfs_rq->min_vruntime
                                          += (new)cfs_rq->min_vruntime
                                   task_rq_unlock()

                                   (via IPI)
                                   sched_ttwu_pending()
                                     raw_spin_lock(rq->lock)
                                     ttwu_do_activate()
                                       ...
                                       enqueue_entity()
                                         child.se->vruntime += cfs_rq->min_vruntime
                                     raw_spin_unlock(rq->lock)

As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_waking_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 kernel/sched/fair.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1db01b..afd13a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5345,11 +5345,13 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 *
 	 * - Moving a forked child which is waiting for being woken up by
 	 *   wake_up_new_task().
+	 * - Moving a task which has been woken up by try_to_wake_up() and
+	 *   waiting for actually being woken up by sched_ttwu_pending().
 	 *
 	 * To prevent boost or penalty in the new cfs_rq caused by delta
 	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
 	 */
-	if (!on_rq && !p->se.sum_exec_runtime)
+	if (!on_rq && (!p->se.sum_exec_runtime || p->state == TASK_WAKING))
 		on_rq = 1;
 
 	if (!on_rq)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH -tip 0/3] sched: some fixes for vruntime calculation related to cgroup movement(v2)
@ 2011-12-19  2:55     ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-19  2:55 UTC (permalink / raw)
  To: Ingo Molnar, LKML, cgroups; +Cc: Peter Zijlstra, Paul Turner, Daisuke Nishimura

Someone who have comments ?
If there are some problems that prevent these patches from being accepted,
I'll fix them.

Thanks,
Daisuke Nishimura

On Thu, 15 Dec 2011 14:35:06 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> Hi, all.
> 
> This is an updated version based on comments and rebased onto the -tip tree.
> 
> These patches fix problems I could see in 3.2-rc2 when testing frequent cgroup movement
> under very high load. Without these patches, some processes were not scheduled
> (although they were queued into rq)for a very long time(minutes or hours!),
> because vruntime of these processes were far bigger than min_vruntime.
> 
> Daisuke Nishimura (3):
>   sched: fix cgroup movement of newly created process
>   sched: fix cgroup movement of forking process
>   sched: fix cgroup movement of waking process
> 
>  kernel/sched/fair.c |   22 ++++++++++++++++++++--
>  1 files changed, 20 insertions(+), 2 deletions(-)
> 
> Thanks,
> Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH -tip 0/3] sched: some fixes for vruntime calculation related to cgroup movement(v2)
@ 2011-12-19  2:55     ` Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: Daisuke Nishimura @ 2011-12-19  2:55 UTC (permalink / raw)
  To: Ingo Molnar, LKML, cgroups; +Cc: Peter Zijlstra, Paul Turner, Daisuke Nishimura

Someone who have comments ?
If there are some problems that prevent these patches from being accepted,
I'll fix them.

Thanks,
Daisuke Nishimura

On Thu, 15 Dec 2011 14:35:06 +0900
Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:

> Hi, all.
> 
> This is an updated version based on comments and rebased onto the -tip tree.
> 
> These patches fix problems I could see in 3.2-rc2 when testing frequent cgroup movement
> under very high load. Without these patches, some processes were not scheduled
> (although they were queued into rq)for a very long time(minutes or hours!),
> because vruntime of these processes were far bigger than min_vruntime.
> 
> Daisuke Nishimura (3):
>   sched: fix cgroup movement of newly created process
>   sched: fix cgroup movement of forking process
>   sched: fix cgroup movement of waking process
> 
>  kernel/sched/fair.c |   22 ++++++++++++++++++++--
>  1 files changed, 20 insertions(+), 2 deletions(-)
> 
> Thanks,
> Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip:sched/core] sched: Fix cgroup movement of forking process
  2011-12-15  5:36     ` Daisuke Nishimura
  (?)
@ 2011-12-21 11:44     ` tip-bot for Daisuke Nishimura
  2011-12-21 17:26       ` Tejun Heo
  -1 siblings, 1 reply; 35+ messages in thread
From: tip-bot for Daisuke Nishimura @ 2011-12-21 11:44 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, pjt, nishimura, tj, tglx, mingo

Commit-ID:  4fc420c91f53e0a9f95665c6b14a1983716081e7
Gitweb:     http://git.kernel.org/tip/4fc420c91f53e0a9f95665c6b14a1983716081e7
Author:     Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
AuthorDate: Thu, 15 Dec 2011 14:36:55 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 21 Dec 2011 10:34:49 +0100

sched: Fix cgroup movement of forking process

There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.

        task_fork_fair()                 sched_move_task()
--------------------------------+---------------------------------
  cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
  curr = cfs_rq->curr
    -> curr is set to the parent.
                                    task_rq_lock()
                                    dequeue_task()
                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
                                    enqueue_task()
                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  raw_spin_lock_irqsave(rq->lock)
  se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
       which has already been updated by sched_move_task().
  se->vruntime -= (old)cfs_rq->min_vruntime.
  raw_spin_unlock_irqrestore(rq->lock)

As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by setting "cfs_rq" and "curr" after
holding the rq->lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cea2fa8..525d69e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5190,8 +5190,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
  */
 static void task_fork_fair(struct task_struct *p)
 {
-	struct cfs_rq *cfs_rq = task_cfs_rq(current);
-	struct sched_entity *se = &p->se, *curr = cfs_rq->curr;
+	struct cfs_rq *cfs_rq;
+	struct sched_entity *se = &p->se, *curr;
 	int this_cpu = smp_processor_id();
 	struct rq *rq = this_rq();
 	unsigned long flags;
@@ -5200,6 +5200,9 @@ static void task_fork_fair(struct task_struct *p)
 
 	update_rq_clock(rq);
 
+	cfs_rq = task_cfs_rq(current);
+	curr = cfs_rq->curr;
+
 	if (unlikely(task_cpu(p) != this_cpu)) {
 		rcu_read_lock();
 		__set_task_cpu(p, this_cpu);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip:sched/core] sched: Fix cgroup movement of newly created process
  2011-12-15  5:36     ` Daisuke Nishimura
  (?)
@ 2011-12-21 11:45     ` tip-bot for Daisuke Nishimura
  -1 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Daisuke Nishimura @ 2011-12-21 11:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, nishimura, tj, tglx, mingo

Commit-ID:  7ceff013c43c0f38f0d26c79507889c6791c0ea0
Gitweb:     http://git.kernel.org/tip/7ceff013c43c0f38f0d26c79507889c6791c0ea0
Author:     Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
AuthorDate: Thu, 15 Dec 2011 14:36:07 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 21 Dec 2011 10:34:51 +0100

sched: Fix cgroup movement of newly created process

There is a small race between do_fork() and sched_move_task(), which is
trying to move the child.

            do_fork()                 sched_move_task()
--------------------------------+---------------------------------
  copy_process()
    sched_fork()
      task_fork_fair()
        -> vruntime of the child is initialized
           based on that of the parent.
  -> we can see the child in "tasks" file now.
                                    task_rq_lock()
                                    task_move_group_fair()
                                      -> child.se.vruntime
                                           -= (old)cfs_rq->min_vruntime
                                           += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  wake_up_new_task()
    ...
    enqueue_entity()
      child.se.vruntime += cfs_rq->min_vruntime

As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_fork_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 525d69e..2d1ac6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5338,6 +5338,19 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 * to another cgroup's rq. This does somewhat interfere with the
 	 * fair sleeper stuff for the first placement, but who cares.
 	 */
+	/*
+	 * When !on_rq, vruntime of the task has usually NOT been normalized.
+	 * But there are some cases where it has already been normalized:
+	 *
+	 * - Moving a forked child which is waiting for being woken up by
+	 *   wake_up_new_task().
+	 *
+	 * To prevent boost or penalty in the new cfs_rq caused by delta
+	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
+	 */
+	if (!on_rq && !p->se.sum_exec_runtime)
+		on_rq = 1;
+
 	if (!on_rq)
 		p->se.vruntime -= cfs_rq_of(&p->se)->min_vruntime;
 	set_task_rq(p, task_cpu(p));

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip:sched/core] sched: Fix cgroup movement of waking process
  2011-12-15  5:37   ` [PATCH 3/3] sched: fix cgroup movement of waking process Daisuke Nishimura
@ 2011-12-21 11:45     ` tip-bot for Daisuke Nishimura
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot for Daisuke Nishimura @ 2011-12-21 11:45 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, nishimura, tj, tglx, mingo

Commit-ID:  62af3783e4fd8ba9e28416e8e91cb3bdd9fb133e
Gitweb:     http://git.kernel.org/tip/62af3783e4fd8ba9e28416e8e91cb3bdd9fb133e
Author:     Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
AuthorDate: Thu, 15 Dec 2011 14:37:41 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 21 Dec 2011 10:34:52 +0100

sched: Fix cgroup movement of waking process

There is a small race between try_to_wake_up() and sched_move_task(),
which is trying to move the process being woken up.

    try_to_wake_up() on CPU0       sched_move_task() on CPU1
--------------------------------+---------------------------------
  raw_spin_lock_irqsave(p->pi_lock)
  task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
  ttwu_queue()
    ->send reschedule IPI to CPU1
  raw_spin_unlock_irqsave(p->pi_lock)
                                   task_rq_lock()
                                     -> tring to aquire both p->pi_lock and
                                        rq->lock with IRQ disabled
                                   task_move_group_fair()
                                     -> p.se.vruntime
                                          -= (old)cfs_rq->min_vruntime
                                          += (new)cfs_rq->min_vruntime
                                   task_rq_unlock()

                                   (via IPI)
                                   sched_ttwu_pending()
                                     raw_spin_lock(rq->lock)
                                     ttwu_do_activate()
                                       ...
                                       enqueue_entity()
                                         child.se->vruntime += cfs_rq->min_vruntime
                                     raw_spin_unlock(rq->lock)

As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_waking_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched/fair.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d1ac6e..bdf1883 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5344,11 +5344,13 @@ static void task_move_group_fair(struct task_struct *p, int on_rq)
 	 *
 	 * - Moving a forked child which is waiting for being woken up by
 	 *   wake_up_new_task().
+	 * - Moving a task which has been woken up by try_to_wake_up() and
+	 *   waiting for actually being woken up by sched_ttwu_pending().
 	 *
 	 * To prevent boost or penalty in the new cfs_rq caused by delta
 	 * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
 	 */
-	if (!on_rq && !p->se.sum_exec_runtime)
+	if (!on_rq && (!p->se.sum_exec_runtime || p->state == TASK_WAKING))
 		on_rq = 1;
 
 	if (!on_rq)

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Fix cgroup movement of forking process
  2011-12-21 11:44     ` [tip:sched/core] sched: Fix " tip-bot for Daisuke Nishimura
@ 2011-12-21 17:26       ` Tejun Heo
  2011-12-21 17:37         ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2011-12-21 17:26 UTC (permalink / raw)
  To: tip-bot for Daisuke Nishimura
  Cc: linux-tip-commits, linux-kernel, hpa, mingo, a.p.zijlstra, pjt,
	tglx, mingo

Hello, guys.

On Wed, Dec 21, 2011 at 03:44:14AM -0800, tip-bot for Daisuke Nishimura wrote:
> sched: Fix cgroup movement of forking process
> 
> There is a small race between task_fork_fair() and sched_move_task(),
> which is trying to move the parent.
> 
>         task_fork_fair()                 sched_move_task()
> --------------------------------+---------------------------------
>   cfs_rq = task_cfs_rq(current)
>     -> cfs_rq is the "old" one.
>   curr = cfs_rq->curr
>     -> curr is set to the parent.
>                                     task_rq_lock()
>                                     dequeue_task()
>                                       ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
>                                     enqueue_task()
>                                       ->parent.se.vruntime += (new)cfs_rq->min_vruntime
>                                     task_rq_unlock()
>   raw_spin_lock_irqsave(rq->lock)
>   se->vruntime = curr->vruntime
>     -> vruntime of the child is set to that of the parent
>        which has already been updated by sched_move_task().
>   se->vruntime -= (old)cfs_rq->min_vruntime.
>   raw_spin_unlock_irqrestore(rq->lock)
> 
> As a result, vruntime of the child becomes far bigger than expected,
> if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
> 
> This patch fixes this problem by setting "cfs_rq" and "curr" after
> holding the rq->lock.

The race shouldn't happen with threadgroup locking scheduled to be
merged for the coming merge window.  sched_fork() and cgroup migration
become exclusive and won't happen concurrently.  Would still make
sense for -stable tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Fix cgroup movement of forking process
  2011-12-21 17:26       ` Tejun Heo
@ 2011-12-21 17:37         ` Tejun Heo
  2011-12-22  1:54           ` Frederic Weisbecker
  0 siblings, 1 reply; 35+ messages in thread
From: Tejun Heo @ 2011-12-21 17:37 UTC (permalink / raw)
  To: tip-bot for Daisuke Nishimura
  Cc: linux-tip-commits, linux-kernel, hpa, mingo, a.p.zijlstra, pjt,
	tglx, mingo, Frederic Weisbecker

(cc'ing Frederic)

On Wed, Dec 21, 2011 at 09:26:32AM -0800, Tejun Heo wrote:
> Hello, guys.
> 
> On Wed, Dec 21, 2011 at 03:44:14AM -0800, tip-bot for Daisuke Nishimura wrote:
> > sched: Fix cgroup movement of forking process
> > 
> > There is a small race between task_fork_fair() and sched_move_task(),
> > which is trying to move the parent.
> > 
> >         task_fork_fair()                 sched_move_task()
> > --------------------------------+---------------------------------
> >   cfs_rq = task_cfs_rq(current)
> >     -> cfs_rq is the "old" one.
> >   curr = cfs_rq->curr
> >     -> curr is set to the parent.
> >                                     task_rq_lock()
> >                                     dequeue_task()
> >                                       ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
> >                                     enqueue_task()
> >                                       ->parent.se.vruntime += (new)cfs_rq->min_vruntime
> >                                     task_rq_unlock()
> >   raw_spin_lock_irqsave(rq->lock)
> >   se->vruntime = curr->vruntime
> >     -> vruntime of the child is set to that of the parent
> >        which has already been updated by sched_move_task().
> >   se->vruntime -= (old)cfs_rq->min_vruntime.
> >   raw_spin_unlock_irqrestore(rq->lock)
> > 
> > As a result, vruntime of the child becomes far bigger than expected,
> > if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
> > 
> > This patch fixes this problem by setting "cfs_rq" and "curr" after
> > holding the rq->lock.
> 
> The race shouldn't happen with threadgroup locking scheduled to be
> merged for the coming merge window.  sched_fork() and cgroup migration
> become exclusive and won't happen concurrently.  Would still make
> sense for -stable tho.

I retract that.  sched_move_task() can also be called from
cgroup_exit() which is outside of threadgroup locking.

Frederic, so, it seems we actually have race conditions here.  I
really wish cgroup made sure that things like this can't happen even
if we pay a bit of overhead in relatively cold paths.  I could be
being unrealistic tho.  Any ideas?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Fix cgroup movement of forking process
  2011-12-21 17:37         ` Tejun Heo
@ 2011-12-22  1:54           ` Frederic Weisbecker
  2011-12-22  2:01             ` Tejun Heo
  0 siblings, 1 reply; 35+ messages in thread
From: Frederic Weisbecker @ 2011-12-22  1:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: tip-bot for Daisuke Nishimura, linux-tip-commits, linux-kernel,
	hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

On Wed, Dec 21, 2011 at 09:37:33AM -0800, Tejun Heo wrote:
> (cc'ing Frederic)
> 
> On Wed, Dec 21, 2011 at 09:26:32AM -0800, Tejun Heo wrote:
> > Hello, guys.
> > 
> > On Wed, Dec 21, 2011 at 03:44:14AM -0800, tip-bot for Daisuke Nishimura wrote:
> > > sched: Fix cgroup movement of forking process
> > > 
> > > There is a small race between task_fork_fair() and sched_move_task(),
> > > which is trying to move the parent.
> > > 
> > >         task_fork_fair()                 sched_move_task()
> > > --------------------------------+---------------------------------
> > >   cfs_rq = task_cfs_rq(current)
> > >     -> cfs_rq is the "old" one.
> > >   curr = cfs_rq->curr
> > >     -> curr is set to the parent.
> > >                                     task_rq_lock()
> > >                                     dequeue_task()
> > >                                       ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
> > >                                     enqueue_task()
> > >                                       ->parent.se.vruntime += (new)cfs_rq->min_vruntime
> > >                                     task_rq_unlock()
> > >   raw_spin_lock_irqsave(rq->lock)
> > >   se->vruntime = curr->vruntime
> > >     -> vruntime of the child is set to that of the parent
> > >        which has already been updated by sched_move_task().
> > >   se->vruntime -= (old)cfs_rq->min_vruntime.
> > >   raw_spin_unlock_irqrestore(rq->lock)
> > > 
> > > As a result, vruntime of the child becomes far bigger than expected,
> > > if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.
> > > 
> > > This patch fixes this problem by setting "cfs_rq" and "curr" after
> > > holding the rq->lock.
> > 
> > The race shouldn't happen with threadgroup locking scheduled to be
> > merged for the coming merge window.  sched_fork() and cgroup migration
> > become exclusive and won't happen concurrently.  Would still make
> > sense for -stable tho.
> 
> I retract that.  sched_move_task() can also be called from
> cgroup_exit() which is outside of threadgroup locking.
> 
> Frederic, so, it seems we actually have race conditions here.  I
> really wish cgroup made sure that things like this can't happen even
> if we pay a bit of overhead in relatively cold paths.  I could be
> being unrealistic tho.  Any ideas?

Hmm, I'm a bit confused about the issue. But doesn't this patch fix the issue?

Also the parent can't be calling sched_fork() and cgroup_exit() at
the same time.

Or am I missing something?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tip:sched/core] sched: Fix cgroup movement of forking process
  2011-12-22  1:54           ` Frederic Weisbecker
@ 2011-12-22  2:01             ` Tejun Heo
  0 siblings, 0 replies; 35+ messages in thread
From: Tejun Heo @ 2011-12-22  2:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: tip-bot for Daisuke Nishimura, linux-tip-commits, linux-kernel,
	hpa, mingo, a.p.zijlstra, pjt, tglx, mingo

Hello,

On Thu, Dec 22, 2011 at 02:54:42AM +0100, Frederic Weisbecker wrote:
> > Frederic, so, it seems we actually have race conditions here.  I
> > really wish cgroup made sure that things like this can't happen even
> > if we pay a bit of overhead in relatively cold paths.  I could be
> > being unrealistic tho.  Any ideas?
> 
> Hmm, I'm a bit confused about the issue. But doesn't this patch fix the issue?

Yeah, I was just saying it would be nice if we don't have to fix the
issue.

> Also the parent can't be calling sched_fork() and cgroup_exit() at
> the same time.

Hmmm... so this one can't happen anymore but there are races involving
cgroup_exit() which doesn't have anything to do with threadgroup lock
(Dalsuke has more patches), so this doesn't really matter.  Let's
forget about this thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2011-12-22  2:02 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-13  6:57 [PATCH 0/3] sched: some fixes for vruntime calculation related to cgroup movement Daisuke Nishimura
2011-12-13  6:57 ` Daisuke Nishimura
2011-12-13  6:57 ` [PATCH 1/3] sched: fix cgroup movement of newly created process Daisuke Nishimura
2011-12-13  6:57   ` Daisuke Nishimura
2011-12-13 12:01   ` Paul Turner
2011-12-13 12:01     ` Paul Turner
2011-12-14  1:04     ` Daisuke Nishimura
2011-12-14  1:04       ` Daisuke Nishimura
2011-12-13 12:41   ` Peter Zijlstra
2011-12-13 12:41     ` Peter Zijlstra
2011-12-14  1:05     ` Daisuke Nishimura
2011-12-14  1:05       ` Daisuke Nishimura
2011-12-13  6:58 ` [PATCH 2/3] sched: fix cgroup movement of forking process Daisuke Nishimura
2011-12-13  6:58   ` Daisuke Nishimura
2011-12-13 12:22   ` Paul Turner
2011-12-13 12:22     ` Paul Turner
2011-12-13  6:59 ` [PATCH 3/3] sched: fix cgroup movement of waking process Daisuke Nishimura
2011-12-13  6:59   ` Daisuke Nishimura
2011-12-13 12:24   ` Paul Turner
2011-12-13 12:24     ` Paul Turner
2011-12-15  5:35 ` [PATCH -tip 0/3] sched: some fixes for vruntime calculation related to cgroup movement(v2) Daisuke Nishimura
2011-12-15  5:36   ` [PATCH 1/3] sched: fix cgroup movement of newly created process Daisuke Nishimura
2011-12-15  5:36     ` Daisuke Nishimura
2011-12-21 11:45     ` [tip:sched/core] sched: Fix " tip-bot for Daisuke Nishimura
2011-12-15  5:36   ` [PATCH 2/3] sched: fix cgroup movement of forking process Daisuke Nishimura
2011-12-15  5:36     ` Daisuke Nishimura
2011-12-21 11:44     ` [tip:sched/core] sched: Fix " tip-bot for Daisuke Nishimura
2011-12-21 17:26       ` Tejun Heo
2011-12-21 17:37         ` Tejun Heo
2011-12-22  1:54           ` Frederic Weisbecker
2011-12-22  2:01             ` Tejun Heo
2011-12-15  5:37   ` [PATCH 3/3] sched: fix cgroup movement of waking process Daisuke Nishimura
2011-12-21 11:45     ` [tip:sched/core] sched: Fix " tip-bot for Daisuke Nishimura
2011-12-19  2:55   ` [PATCH -tip 0/3] sched: some fixes for vruntime calculation related to cgroup movement(v2) Daisuke Nishimura
2011-12-19  2:55     ` Daisuke Nishimura

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.