From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754385AbbGAO4G (ORCPT <rfc822;w@1wt.eu>);
	Wed, 1 Jul 2015 10:56:06 -0400
Received: from bastet.se.axis.com ([195.60.68.11]:37105 "EHLO
	bastet.se.axis.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754111AbbGAOzy (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 1 Jul 2015 10:55:54 -0400
Date: Wed, 1 Jul 2015 16:55:51 +0200
From: Rabin Vincent <rabin.vincent@axis.com>
To: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: "mingo@redhat.com" <mingo@redhat.com>,
        "peterz@infradead.org" <peterz@infradead.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
Message-ID: <20150701145551.GA15690@axis.com>
References: <20150630143057.GA31689@axis.com>
 <1435728995.9397.7.camel@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1435728995.9397.7.camel@gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jul 01, 2015 at 07:36:35AM +0200, Mike Galbraith wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5897,7 +5897,7 @@ static int detach_tasks(struct lb_env *e
>  {
>  	struct list_head *tasks = &env->src_rq->cfs_tasks;
>  	struct task_struct *p;
> -	unsigned long load;
> +	unsigned long load, d_load = 0, s_load = env->src_rq->load.weight;
>  	int detached = 0;
>  
>  	lockdep_assert_held(&env->src_rq->lock);
> @@ -5936,6 +5936,11 @@ static int detach_tasks(struct lb_env *e
>  
>  		detached++;
>  		env->imbalance -= load;
> +		if (!load) {
> +			load = min_t(unsigned long, env->imbalance, p->se.load.weight);
> +			trace_printk("%s:%d is non-contributor - count as %ld\n", p->comm, p->pid, load);
> +		}
> +		d_load += load;
>  
>  #ifdef CONFIG_PREEMPT
>  		/*
> @@ -5954,6 +5959,18 @@ static int detach_tasks(struct lb_env *e
>  		if (env->imbalance <= 0)
>  			break;
>  
> +		/*
> +		 * We don't want to bleed busiest_rq dry either.  Weighted load
> +		 * and/or imbalance may be dinky, load contribution can even be
> +		 * zero, perhaps causing us to over balancem we had not assigned
> +		 * it above.
> +		 */
> +		if (env->src_rq->load.weight <= env->dst_rq->load.weight + d_load) {
> +			trace_printk("OINK - imbal: %ld  load: %ld  run: %d  det: %d  sload_was: %ld sload_is: %ld  dload: %ld\n",
> +				env->imbalance, load, env->src_rq->nr_running, detached, s_load, env->src_rq->load.weight, env->dst_rq->load.weight+d_load);
> +			break;
> +		}
> +
>  		continue;
>  next:
>  		list_move_tail(&p->se.group_node, tasks);
> 

I've tried to analyse how your patch would affect the situation in one
of the crash dumps which I have of the problem.

In this dump, cpu0 is in the middle of dequeuing all tasks from cpu1.
rcu_sched has already been detached and there are two tasks left, one of them
which is being processed by dequeue_entity_load_avg() called from
dequeue_task() at the time the watchdog hits.  lb_env is the following.
imbalance is, as you can see, 60.

 crash> struct lb_env 8054fd50
 struct lb_env {
   sd = 0x8fc13e00, 
   src_rq = 0x81297200, 
   src_cpu = 1, 
   dst_cpu = 0, 
   dst_rq = 0x8128e200, 
   dst_grpmask = 0x0, 
   new_dst_cpu = 0, 
   idle = CPU_NEWLY_IDLE, 
   imbalance = 60, 
   cpus = 0x8128d238, 
   flags = 0, 
   loop = 2, 
   loop_break = 32, 
   loop_max = 3, 
   fbq_type = all, 
   tasks = {
     next = 0x8fc4c6ec, 
     prev = 0x8fc4c6ec
   }
 }

Weights of the runqueues:

 crash> struct rq.load.weight runqueues:0,1
 [0]: 8128e200
   load.weight = 0,
 [1]: 81297200
   load.weight = 1935,

The only running tasks on the system are these three:

 crash> foreach RU ps
    PID    PPID  CPU   TASK    ST  %MEM     VSZ    RSS  COMM
 >     0      0   0  8056d8b0  RU   0.0       0      0  [swapper/0]
 >     0      0   1  8fc5cd18  RU   0.0       0      0  [swapper/1]
 >     0      0   2  8fc5c6b0  RU   0.0       0      0  [swapper/2]
 >     0      0   3  8fc5c048  RU   0.0       0      0  [swapper/3]
       7      2   0  8fc4c690  RU   0.0       0      0  [rcu_sched]
      30      2   1  8fd26108  RU   0.0       0      0  [kswapd0]
     413      1   1  8edda408  RU   0.6    1900    416  rngd

And the load.weight and load_avg_contribs for them and their parent SEs:

 crash> foreach 7 30 413 load
 PID: 7      TASK: 8fc4c690  CPU: 0   COMMAND: "rcu_sched"
  task_h_load():   325 [ = (load_avg_contrib {    5} * cfs_rq->h_load {   65}) / (cfs_rq->runnable_load_avg {    0} + 1) ]
  SE: 8fc4c6d8 load_avg_contrib:     5 load.weight:  1024 PARENT: 00000000 GROUPNAME: (null)
 
 PID: 30     TASK: 8fd26108  CPU: 1   COMMAND: "kswapd0"
  task_h_load():    10 [ = (load_avg_contrib {   10} * cfs_rq->h_load {  133}) / (cfs_rq->runnable_load_avg {  128} + 1) ]
  SE: 8fd26150 load_avg_contrib:    10 load.weight:  1024 PARENT: 00000000 GROUPNAME: (null)
 
 PID: 413    TASK: 8edda408  CPU: 1   COMMAND: "rngd"
  task_h_load():     0 [ = (load_avg_contrib {    0} * cfs_rq->h_load {    0}) / (cfs_rq->runnable_load_avg {    0} + 1) ]
  SE: 8edda450 load_avg_contrib:     0 load.weight:  1024 PARENT: 8fffbd00 GROUPNAME: (null)
  SE: 8fffbd00 load_avg_contrib:     0 load.weight:     2 PARENT: 8f531f80 GROUPNAME: rngd@hwrng.service
  SE: 8f531f80 load_avg_contrib:     0 load.weight:  1024 PARENT: 8f456e00 GROUPNAME: system-rngd.slice
  SE: 8f456e00 load_avg_contrib:   118 load.weight:   911 PARENT: 00000000 GROUPNAME: system.slice

Given the above, we can see that with your patch:

 - dst_rq->load.weight is 0 and will not change in this loop.

 - src_rq->load.weight was 1935 + 1024 before the loop.  It will go down
   to 1935 (already has), 1024, and then 0.
 
 - d_load will be 325*, 335, and then 395.

(* - probably not exactly since rcu_sched has already had set_task_rq() called
[cfs_rq switched] on it, but I guess it's actually going to be much lower based
on the other dumps I see where rcu_sched hasn't be switched yet).

So, we will not hit the "if (env->src_rq->load.weight <=
env->dst_rq->load.weight + d_load)" condition to break out of the loop until we
actualy move all tasks.  So the patch will not have any effect on this case.
Or am I missing something?

We'll set up a test anyway with the patch; the problem usually takes a
couple of days to reproduce.

/Rabin