linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
@ 2021-04-25  8:09 Odin Ugedal
  2021-04-25  8:09 ` [PATCH 1/1] " Odin Ugedal
  2021-04-26 14:58 ` [PATCH 0/1] " Vincent Guittot
  0 siblings, 2 replies; 15+ messages in thread
From: Odin Ugedal @ 2021-04-25  8:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: cgroups, linux-kernel, Odin Ugedal

TL;DR: Read the patch msg first; hopefully it makes sense. Feel free
to cc more relevant people, and (as I am pretty new to this) feedback
is highly appreciated!

I thought it was smart to split the discussion of the patch and the issuse, so
here is some (wall of text, sorry) discussion around it:

This patch fixes an extremely strange edge case, discovered more or less by
accident, where the fairness of the cfs scheduler get skewed quite
extensively. This often leads to two cgroups who are supposed to get
share cpu time 50/50, instead getting something like 90/10 or
even 99/1 (!!) in some rare cases (see below for example). I find this
edge case quite interesting, especially since I am able to so easily
reproduce it on all machines and all (4.8+) kernels.

The issue can be easily reproduced on all kernels 4.8+, and afaik. most
"container runtimes" are affected. I do believe that I have wondered about
this issue earlier, but not actually understood what was causing it. There
is really (afaik.) no way to find the issue from userspace without using
a tracing tool like bpftrace. I am _sure_ this affects real production
environments _everywhere_, at least those running the normal container/using
cgroup tools together with cpu pinning; but verifying that the
fairness is skewed is quite hard, or more or less impossible in such cases,
especially when running multiple workloads simultaneously. I did inspect some
of my production servers, and found a set of this on multiple cgroups.

It _is_ possible to read '/proc/sched_debug', but since that list only prints
the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for,
are missing; making it look like the problem isn't becuase of cfs.

I think the current patch is the best way solve the issue, but I will happily
discuss other possible solutions. One could also just "don't" propagate the
load to the task_group, but I think the current implementation on that part
is the correct way to do it. This is at least my best solution to avoid adding
logic to a "hotter" path that also would require more logic.

The only thing I am thinking a bit about is if the cfs_rq (or one of its ancestors) is
throttled. In case the cfs_rq->throttled==1, I don't think we currently can skip, since
it only adds to the leaf_list in case (cfs_rq->nr_running >= 1), and the same applies for
(cfs_rq->throttle_count >= 1), since we cannot guarantee that it will be added again. Will
this code however interfere with the throttling mechanism? I have tested inserting new procs
to throttled cgroups (or children cgroups of throttled ones) with the patch applied,
and it works as I expect it to do (but other people might have more insight). 

Also, another solution may be to don't add the load before the task is actually
enqueued, avoiding this issue all together. But is that better (I assume that will
be more in the "hot path", and require quite a bit more code)?

There may definetly be a better solution to this that I still haven't found, and that
other more experienced kernel devs see right away. :)


Also feel free to suggest additions/wording of the patch
message and title, and/or the comment in the code to make it more clear.

This issue was introduced in 4.8, so all stable (4.9+) releases should probably
get this (the final solution at least), or?



Below is various ways/scripts to reproduce - sorry this is so long (and
sorry for bash in general), but thought people might be interested in them:

note: due to the nature of the issue, the "lost" load is different each time, so
the values change each time, and sometimes/often end up at ~50/50; but my testing
shows that it keep happening almost every time:


Example on cgruoup v1. Often results in 60/40 load:
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       18026 63.0  0.0   3676   100 pts/7    R+   13:09   0:06 stress --cpu 1
root       18036 36.6  0.0   3676   100 pts/7    R+   13:09   0:04 stress --cpu 1
--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CPU=0

function run_sandbox {
  local CG_CPUSET="$1"
  local CG_CPU="$2"
  local CMD="$3"

  local PIPE="$(mktemp -u)"
  mkfifo "$PIPE"
  sh -c "read < $PIPE ; exec $CMD" &
  local TASK="$!"
  sleep .1
  mkdir -p "$CG_CPUSET"
  mkdir -p "$CG_CPU"
  tee "$CG_CPU"/cgroup.procs <<< "$TASK"

  tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

  tee "$PIPE" <<< sandox_done
  rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0" 
tee "$CGROUP_CPUSET"/cpuset.mems <<< "0" 

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"

read # click enter to cleanup
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
--- bash end

Example on cgroup v2 with sub cgroup (same as described in commit message),
where both should get 50/50, but instead getting 99% and 1% (!!).

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1

(in case of systemd on vg v2- make sure some slice/scope is delegated(?)/use
cpusets, otherwise systemd will fight you)
--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CPU=0

function run_sandbox {
  local CG_CPUSET="$1"
  local CG_CPU="$2"
  local CMD="$3"

  local PIPE="$(mktemp -u)"
  mkfifo "$PIPE"
  sh -c "read < $PIPE ; exec $CMD" &
  local TASK="$!"
  sleep .01
  mkdir -p "$CG_CPUSET"
  mkdir -p "$CG_CPU"
  tee "$CG_CPU"/cgroup.procs <<< "$TASK"

  tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

  tee "$PIPE" <<< sandox_done
  rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0" 
tee "$CGROUP_CPUSET"/cpuset.mems <<< "0" 

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"

read # click enter to cleanup
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
--- bash end



For those who only want to run docker stuff:
(on cgroup v1, you must use systemd driver)
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        9291 60.1  0.0   7320    96 pts/0    R+   13:18   0:07 /usr/bin/stress --verbose --cpu 1
root        9388 45.5  0.0   7320    96 pts/0    R+   13:18   0:04 /usr/bin/stress --verbose --cpu 1
--- bash start
docker run --cpuset-cpus=0 --rm -it an-image-with-stress
docker run --cpuset-cpus=0 --rm -it an-image-with-stress
--- bash end


Here is a bpftrace script that can show what is happening (again. sorry for bash,
but bpftrace only allow constants as array indexes as of now, so this is the best I can do).
This script also needs bpftrace v0.13 (not released as of now, so currently you have
to compile master) or newer.


--- bash start
PROBE='kfunc:sched_group_set_shares{
  printf("cgroup: %s/%s\n",
    str(args->tg->css.cgroup->kn->parent->name),
    str(args->tg->css.cgroup->kn->name));
  printf(
"cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg\n"
);
}'
for i in $(seq 0 $(($(nproc)-1))); do
    PROBE="$PROBE""$(sed "s/cpu_nr/$i/" <<<' kfunc:sched_group_set_shares{
    printf("%-4d %-12llu %-13llu %-17llu %-11d %-16d %-20llu %d\n",
    cpu_nr,
    (args->tg->cfs_rq[cpu_nr])->load.weight,
    (args->tg->cfs_rq[cpu_nr])->avg.load_avg,
    (args->tg->cfs_rq[cpu_nr])->removed.load_avg,
    (args->tg->cfs_rq[cpu_nr])->removed.nr,
    (args->tg->cfs_rq[cpu_nr])->on_list,
    (args->tg->cfs_rq[cpu_nr])->tg_load_avg_contrib,
    args->tg->load_avg.counter
    );
}')"
done
PROBE="$PROBE"'kfunc:sched_group_set_shares{
  printf("\n");
}'

bpftrace -e "$PROBE"
--- bash end


When running the bpftrace script when the sub cgroup example is running, and
executing (just setting the weight of the cgroup the same value as before, no change):

--- bash start
tee /sys/fs/cgroup/slice/cg-1/sub/cpu.weight <<< 1
tee /sys/fs/cgroup/slice/cg-2/sub/cpu.weight <<< 10000
tee /sys/fs/cgroup/slice/cg-1/cpu.weight <<< 100
tee /sys/fs/cgroup/slice/cg-2/cpu.weight <<< 100
--- bash end

the output is:

--- output start
Attaching 6 probes...
cgroup: cg-1/sub
cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
0    1048576      1023          0                 0           1                1034                 1662
1    0            0             0                 0           0                0                    1662
2    0            0             0                 0           0                0                    1662
3    0            628           628               1           0                628                  1662

cgroup: cg-2/sub
cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
0    1048576      1023          0                 0           1                1023                 1830
1    0            0             0                 0           0                0                    1830
2    0            0             0                 0           0                0                    1830
3    0            807           807               1           0                807                  1830

cgroup: slice/cg-1
cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
0    6347         5             0                 0           1                5                    593
1    0            0             0                 0           0                0                    593
2    0            0             0                 0           0                0                    593
3    0            5             0                 0           0                588                  593

cgroup: slice/cg-2
cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
0    58642371     57263         0                 0           1                57909                58665
1    0            0             0                 0           0                0                    58665
2    0            0             0                 0           0                0                    58665
3    0            75615         0                 0           0                756                  58665

--- output end

We can clearly see that both cg-1/sub and cg-2/sub have removed.nr==1 on cpu 3,
and therefore still contribute to the tg->load_avg. Since removed.on_list==0, the load would
never be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not
be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but
the patch will properly decay the load on them as well). 

With this path, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected.

Odin Ugedal (1):
  sched/fair: Fix unfairness caused by missing load decay

 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-25  8:09 [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay Odin Ugedal
@ 2021-04-25  8:09 ` Odin Ugedal
  2021-04-27 14:26   ` Vincent Guittot
  2021-04-26 14:58 ` [PATCH 0/1] " Vincent Guittot
  1 sibling, 1 reply; 15+ messages in thread
From: Odin Ugedal @ 2021-04-25  8:09 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: cgroups, linux-kernel, Odin Ugedal

This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.

When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.

For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.

parent/
  cg-1/
    cpu.weight: 100
    cpuset.cpus: 1
  cg-2/
    cpu.weight: 100
    cpuset.cpus: 1

If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.

$ ps u -C stress
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1

parent/
  cg-1/
    cpu.weight: 100
    sub-group/
      cpu.weight: 1
      cpuset.cpus: 1
  cg-2/
    cpu.weight: 100
    sub-group/
      cpu.weight: 10000
      cpuset.cpus: 1

This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.

Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")
Signed-off-by: Odin Ugedal <odin@uged.al>
---
 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..ad7556f99b4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10916,6 +10916,19 @@ static void attach_task_cfs_rq(struct task_struct *p)
 
 	if (!vruntime_normalized(p))
 		se->vruntime += cfs_rq->min_vruntime;
+
+	/*
+	 * Make sure the attached load will decay properly
+	 * in case the task is moved to another cpu before
+	 * being queued.
+	 */
+	if (!task_on_rq_queued(p)) {
+		for_each_sched_entity(se) {
+			if (se->on_rq)
+				break;
+			list_add_leaf_cfs_rq(cfs_rq_of(se));
+		}
+	}
 }
 
 static void switched_from_fair(struct rq *rq, struct task_struct *p)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-25  8:09 [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay Odin Ugedal
  2021-04-25  8:09 ` [PATCH 1/1] " Odin Ugedal
@ 2021-04-26 14:58 ` Vincent Guittot
  2021-04-26 16:33   ` Odin Ugedal
  2021-04-27  8:36   ` Odin Ugedal
  1 sibling, 2 replies; 15+ messages in thread
From: Vincent Guittot @ 2021-04-26 14:58 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel

On Sun, 25 Apr 2021 at 10:11, Odin Ugedal <odin@uged.al> wrote:
>
> TL;DR: Read the patch msg first; hopefully it makes sense. Feel free
> to cc more relevant people, and (as I am pretty new to this) feedback
> is highly appreciated!
>
> I thought it was smart to split the discussion of the patch and the issuse, so
> here is some (wall of text, sorry) discussion around it:
>
> This patch fixes an extremely strange edge case, discovered more or less by
> accident, where the fairness of the cfs scheduler get skewed quite
> extensively. This often leads to two cgroups who are supposed to get
> share cpu time 50/50, instead getting something like 90/10 or
> even 99/1 (!!) in some rare cases (see below for example). I find this
> edge case quite interesting, especially since I am able to so easily
> reproduce it on all machines and all (4.8+) kernels.
>
> The issue can be easily reproduced on all kernels 4.8+, and afaik. most
> "container runtimes" are affected. I do believe that I have wondered about
> this issue earlier, but not actually understood what was causing it. There
> is really (afaik.) no way to find the issue from userspace without using
> a tracing tool like bpftrace. I am _sure_ this affects real production
> environments _everywhere_, at least those running the normal container/using
> cgroup tools together with cpu pinning; but verifying that the
> fairness is skewed is quite hard, or more or less impossible in such cases,
> especially when running multiple workloads simultaneously. I did inspect some
> of my production servers, and found a set of this on multiple cgroups.
>
> It _is_ possible to read '/proc/sched_debug', but since that list only prints
> the cfs_rq's from "leaf_cfs_rq_list", the cfs_rq's we are _really_ looking for,
> are missing; making it look like the problem isn't becuase of cfs.
>
> I think the current patch is the best way solve the issue, but I will happily
> discuss other possible solutions. One could also just "don't" propagate the
> load to the task_group, but I think the current implementation on that part
> is the correct way to do it. This is at least my best solution to avoid adding
> logic to a "hotter" path that also would require more logic.
>
> The only thing I am thinking a bit about is if the cfs_rq (or one of its ancestors) is
> throttled. In case the cfs_rq->throttled==1, I don't think we currently can skip, since
> it only adds to the leaf_list in case (cfs_rq->nr_running >= 1), and the same applies for
> (cfs_rq->throttle_count >= 1), since we cannot guarantee that it will be added again. Will
> this code however interfere with the throttling mechanism? I have tested inserting new procs
> to throttled cgroups (or children cgroups of throttled ones) with the patch applied,
> and it works as I expect it to do (but other people might have more insight).
>
> Also, another solution may be to don't add the load before the task is actually
> enqueued, avoiding this issue all together. But is that better (I assume that will
> be more in the "hot path", and require quite a bit more code)?
>
> There may definetly be a better solution to this that I still haven't found, and that
> other more experienced kernel devs see right away. :)
>
>
> Also feel free to suggest additions/wording of the patch
> message and title, and/or the comment in the code to make it more clear.
>
> This issue was introduced in 4.8, so all stable (4.9+) releases should probably
> get this (the final solution at least), or?
>

Have you been able to reproduce this on mainline ?

>
>
> Below is various ways/scripts to reproduce - sorry this is so long (and
> sorry for bash in general), but thought people might be interested in them:
>
> note: due to the nature of the issue, the "lost" load is different each time, so
> the values change each time, and sometimes/often end up at ~50/50; but my testing
> shows that it keep happening almost every time:
>
>
> Example on cgruoup v1. Often results in 60/40 load:
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       18026 63.0  0.0   3676   100 pts/7    R+   13:09   0:06 stress --cpu 1
> root       18036 36.6  0.0   3676   100 pts/7    R+   13:09   0:04 stress --cpu 1

When running the script below on v5.12, I'm not able to reproduce your problem

> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CPU=0
>
> function run_sandbox {
>   local CG_CPUSET="$1"
>   local CG_CPU="$2"
>   local CMD="$3"
>
>   local PIPE="$(mktemp -u)"
>   mkfifo "$PIPE"
>   sh -c "read < $PIPE ; exec $CMD" &
>   local TASK="$!"
>   sleep .1
>   mkdir -p "$CG_CPUSET"
>   mkdir -p "$CG_CPU"
>   tee "$CG_CPU"/cgroup.procs <<< "$TASK"
>
>   tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
>   tee "$PIPE" <<< sandox_done
>   rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"
>
> read # click enter to cleanup
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
> --- bash end
>
> Example on cgroup v2 with sub cgroup (same as described in commit message),
> where both should get 50/50, but instead getting 99% and 1% (!!).
>
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
> root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1
>
> (in case of systemd on vg v2- make sure some slice/scope is delegated(?)/use
> cpusets, otherwise systemd will fight you)
> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CPU=0
>
> function run_sandbox {
>   local CG_CPUSET="$1"
>   local CG_CPU="$2"
>   local CMD="$3"
>
>   local PIPE="$(mktemp -u)"
>   mkfifo "$PIPE"
>   sh -c "read < $PIPE ; exec $CMD" &
>   local TASK="$!"
>   sleep .01
>   mkdir -p "$CG_CPUSET"
>   mkdir -p "$CG_CPU"
>   tee "$CG_CPU"/cgroup.procs <<< "$TASK"
>
>   tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
>   tee "$PIPE" <<< sandox_done
>   rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "0"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "0"
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" "stress --cpu 1"
>
> read # click enter to cleanup
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-1,cg-2,}
> --- bash end
>
>
>
> For those who only want to run docker stuff:
> (on cgroup v1, you must use systemd driver)
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root        9291 60.1  0.0   7320    96 pts/0    R+   13:18   0:07 /usr/bin/stress --verbose --cpu 1
> root        9388 45.5  0.0   7320    96 pts/0    R+   13:18   0:04 /usr/bin/stress --verbose --cpu 1
> --- bash start
> docker run --cpuset-cpus=0 --rm -it an-image-with-stress
> docker run --cpuset-cpus=0 --rm -it an-image-with-stress
> --- bash end
>
>
> Here is a bpftrace script that can show what is happening (again. sorry for bash,
> but bpftrace only allow constants as array indexes as of now, so this is the best I can do).
> This script also needs bpftrace v0.13 (not released as of now, so currently you have
> to compile master) or newer.
>
>
> --- bash start
> PROBE='kfunc:sched_group_set_shares{
>   printf("cgroup: %s/%s\n",
>     str(args->tg->css.cgroup->kn->parent->name),
>     str(args->tg->css.cgroup->kn->name));
>   printf(
> "cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg\n"
> );
> }'
> for i in $(seq 0 $(($(nproc)-1))); do
>     PROBE="$PROBE""$(sed "s/cpu_nr/$i/" <<<' kfunc:sched_group_set_shares{
>     printf("%-4d %-12llu %-13llu %-17llu %-11d %-16d %-20llu %d\n",
>     cpu_nr,
>     (args->tg->cfs_rq[cpu_nr])->load.weight,
>     (args->tg->cfs_rq[cpu_nr])->avg.load_avg,
>     (args->tg->cfs_rq[cpu_nr])->removed.load_avg,
>     (args->tg->cfs_rq[cpu_nr])->removed.nr,
>     (args->tg->cfs_rq[cpu_nr])->on_list,
>     (args->tg->cfs_rq[cpu_nr])->tg_load_avg_contrib,
>     args->tg->load_avg.counter
>     );
> }')"
> done
> PROBE="$PROBE"'kfunc:sched_group_set_shares{
>   printf("\n");
> }'
>
> bpftrace -e "$PROBE"
> --- bash end
>
>
> When running the bpftrace script when the sub cgroup example is running, and
> executing (just setting the weight of the cgroup the same value as before, no change):
>
> --- bash start
> tee /sys/fs/cgroup/slice/cg-1/sub/cpu.weight <<< 1
> tee /sys/fs/cgroup/slice/cg-2/sub/cpu.weight <<< 10000
> tee /sys/fs/cgroup/slice/cg-1/cpu.weight <<< 100
> tee /sys/fs/cgroup/slice/cg-2/cpu.weight <<< 100
> --- bash end
>
> the output is:
>
> --- output start
> Attaching 6 probes...
> cgroup: cg-1/sub
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    1048576      1023          0                 0           1                1034                 1662
> 1    0            0             0                 0           0                0                    1662
> 2    0            0             0                 0           0                0                    1662
> 3    0            628           628               1           0                628                  1662
>
> cgroup: cg-2/sub
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    1048576      1023          0                 0           1                1023                 1830
> 1    0            0             0                 0           0                0                    1830
> 2    0            0             0                 0           0                0                    1830
> 3    0            807           807               1           0                807                  1830
>
> cgroup: slice/cg-1
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    6347         5             0                 0           1                5                    593
> 1    0            0             0                 0           0                0                    593
> 2    0            0             0                 0           0                0                    593
> 3    0            5             0                 0           0                588                  593
>
> cgroup: slice/cg-2
> cpu  load.weight  avg.load_avg  removed.load_avg  removed.nr  removed.on_list  tg_load_avg_contrib  tg->load_avg
> 0    58642371     57263         0                 0           1                57909                58665
> 1    0            0             0                 0           0                0                    58665
> 2    0            0             0                 0           0                0                    58665
> 3    0            75615         0                 0           0                756                  58665
>
> --- output end
>
> We can clearly see that both cg-1/sub and cg-2/sub have removed.nr==1 on cpu 3,
> and therefore still contribute to the tg->load_avg. Since removed.on_list==0, the load would
> never be "cleaned" up unless a new task starts on that cpu, but due to the cpuset, that would not
> be the case. slice/cg-1 and slice/cg-2 also have load attached to cpu 3 that isn't removed (but
> the patch will properly decay the load on them as well).
>
> With this path, all of these examples just end up sharing cpu time in a fair 50/50 way, as expected.
>
> Odin Ugedal (1):
>   sched/fair: Fix unfairness caused by missing load decay
>
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-26 14:58 ` [PATCH 0/1] " Vincent Guittot
@ 2021-04-26 16:33   ` Odin Ugedal
  2021-04-27 10:55     ` Vincent Guittot
  2021-04-27  8:36   ` Odin Ugedal
  1 sibling, 1 reply; 15+ messages in thread
From: Odin Ugedal @ 2021-04-26 16:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	linux-kernel

Hi,


> Have you been able to reproduce this on mainline ?

Yes. I have been debugging and testing with v5.12-rc8. After I found
the suspected
commit in ~v4.8, I compiled both the v4.4.267 and v4.9.267, and was able to
successfully reproduce it on v4.9.267 and not on v4.4.267. It is also
reproducible
on 5.11.16-arch1-1 that my distro ships, and it is reproducible on all
the machines
I have tested.

> When running the script below on v5.12, I'm not able to reproduce your problem

v5.12 is pretty fresh, so I have not tested on anything before v5.12-rc8. I did
compile v5.12.0 now, and I am able to reproduce it there as well.

Which version did you try (the one for cgroup v1 or v2)? And/or did you try
to run the inspection bpftrace script? If you tested the cg v1
version, it will often
end up at 50/50, 51/49 etc., and sometimes 60/40+-, making it hard to
verify without inspection.

I have attached a version of the "sub cgroup" example for cgroup v1,
that also force
the process to start on cpu 1 (CPU_ME), and sends it over to cpu 0
(CPU) after attaching
to the new cgroup. That will make it evident each time. This example should also
always end up with 50/50 per stress process, but "always" ends up more
like 99/1.

Can you confirm if you are able to reproduce with this version?

--- bash start
CGROUP_CPU=/sys/fs/cgroup/cpu/slice
CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
CGROUP_CPUSET_ME=/sys/fs/cgroup/cpuset/me
CPU=0
CPU_ME=1

function run_sandbox {
  local CG_CPUSET="$1"
  local CG_CPU="$2"
  local INNER_SHARES="$3"
  local CMD="$4"

  local PIPE="$(mktemp -u)"
  mkfifo "$PIPE"
  sh -c "read < $PIPE ; exec $CMD" &
  local TASK="$!"
  sleep .1
  mkdir -p "$CG_CPUSET"
  mkdir -p "$CG_CPU"/sub
  tee "$CG_CPU"/sub/cgroup.procs <<< "$TASK"
  tee "$CG_CPU"/sub/cpu.shares <<< "$INNER_SHARES"

  tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"

  tee "$PIPE" <<< sandox_done
  rm "$PIPE"
}

mkdir -p "$CGROUP_CPU"
mkdir -p "$CGROUP_CPUSET"
mkdir -p "$CGROUP_CPUSET_ME"

tee "$CGROUP_CPUSET"/cpuset.cpus <<< "$CPU"
tee "$CGROUP_CPUSET"/cpuset.mems <<< "$CPU"

tee "$CGROUP_CPUSET_ME"/cpuset.cpus <<< "$CPU_ME"
echo $$ | tee "$CGROUP_CPUSET_ME"/cgroup.procs

run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" 50000 "stress --cpu 1"
run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" 2     "stress --cpu 1"

read # click enter to cleanup and stop all stress procs
killall stress
sleep .2
rmdir /sys/fs/cgroup/cpuset/slice/
rmdir /sys/fs/cgroup/cpu/slice/{cg-{1,2}{/sub,},}
--- bash end


Thanks
Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-26 14:58 ` [PATCH 0/1] " Vincent Guittot
  2021-04-26 16:33   ` Odin Ugedal
@ 2021-04-27  8:36   ` Odin Ugedal
  1 sibling, 0 replies; 15+ messages in thread
From: Odin Ugedal @ 2021-04-27  8:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	linux-kernel

Also, instead of bpftrace, one can look at the /proc/sched_debug file,
and infer from there.

Something like:

$ cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(:/slice)|load_avg"

gives me the output (when one stress proc gets 99%, and the other one 1%):

cfs_rq[0]:/slice/cg-2/sub
  .load_avg                      : 1023
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1035
  .tg_load_avg                   : 1870
  .se->avg.load_avg              : 56391
cfs_rq[0]:/slice/cg-1/sub
  .load_avg                      : 1023
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1024
  .tg_load_avg                   : 1847
  .se->avg.load_avg              : 4
cfs_rq[0]:/slice/cg-1
  .load_avg                      : 4
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 4
  .tg_load_avg                   : 794
  .se->avg.load_avg              : 5
cfs_rq[0]:/slice/cg-2
  .load_avg                      : 56401
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 56700
  .tg_load_avg                   : 57496
  .se->avg.load_avg              : 1008
cfs_rq[0]:/slice
  .load_avg                      : 1015
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1009
  .tg_load_avg                   : 2314
  .se->avg.load_avg              : 447


As can be seen here, no other cfs_rq for the relevant cgroups are
"active" and listed, but they still contribute to eg. the "tg_load_avg". In this
example, "cfs_rq[0]:/slice/cg-1" has a load_avg of 4, and contributes with 4 to
tg_load_avg.  However, the total total tg_load_avg is 794. That means
that the other 790 have to come from somewhere, and in this example,
they come from the cfs_rq on another cpu.

Hopefully that clarified a bit.

For reference, here is the output when the issue is not occuring:
cfs_rq[1]:/slice/cg-2/sub
  .load_avg                      : 1024
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1039
  .tg_load_avg                   : 1039
  .se->avg.load_avg              : 1
cfs_rq[1]:/slice/cg-1/sub
  .load_avg                      : 1023
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1034
  .tg_load_avg                   : 1034
  .se->avg.load_avg              : 49994
cfs_rq[1]:/slice/cg-1
  .load_avg                      : 49998
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 49534
  .tg_load_avg                   : 49534
  .se->avg.load_avg              : 1023
cfs_rq[1]:/slice/cg-2
  .load_avg                      : 1
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1
  .tg_load_avg                   : 1
  .se->avg.load_avg              : 1023
cfs_rq[1]:/slice
  .load_avg                      : 2048
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 2021
  .tg_load_avg                   : 2021
  .se->avg.load_avg              : 1023


Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-26 16:33   ` Odin Ugedal
@ 2021-04-27 10:55     ` Vincent Guittot
  2021-04-27 11:24       ` Odin Ugedal
  0 siblings, 1 reply; 15+ messages in thread
From: Vincent Guittot @ 2021-04-27 10:55 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	linux-kernel

On Mon, 26 Apr 2021 at 18:33, Odin Ugedal <odin@ugedal.com> wrote:
>
> Hi,
>
>
> > Have you been able to reproduce this on mainline ?
>
> Yes. I have been debugging and testing with v5.12-rc8. After I found
> the suspected
> commit in ~v4.8, I compiled both the v4.4.267 and v4.9.267, and was able to
> successfully reproduce it on v4.9.267 and not on v4.4.267. It is also
> reproducible
> on 5.11.16-arch1-1 that my distro ships, and it is reproducible on all
> the machines
> I have tested.
>
> > When running the script below on v5.12, I'm not able to reproduce your problem
>
> v5.12 is pretty fresh, so I have not tested on anything before v5.12-rc8. I did
> compile v5.12.0 now, and I am able to reproduce it there as well.

I wanted to say one v5.12-rcX version to make sure this is still a
valid problem on latest version

>
> Which version did you try (the one for cgroup v1 or v2)? And/or did you try
> to run the inspection bpftrace script? If you tested the cg v1
> version, it will often
> end up at 50/50, 51/49 etc., and sometimes 60/40+-, making it hard to
> verify without inspection.

I tried cgroup v1 and v2 but not the bpf script

>
> I have attached a version of the "sub cgroup" example for cgroup v1,
> that also force
> the process to start on cpu 1 (CPU_ME), and sends it over to cpu 0
> (CPU) after attaching
> to the new cgroup. That will make it evident each time. This example should also
> always end up with 50/50 per stress process, but "always" ends up more
> like 99/1.
>
> Can you confirm if you are able to reproduce with this version?

I confirm that I can see a ratio of 4ms vs 204ms running time with the
patch below. But when I look more deeply in my trace (I have
instrumented the code), it seems that the 2 stress-ng don't belong to
the same cgroup but remained in cg-1 and cg-2 which explains such
running time difference. So your script doesn't reproduce the bug you
want to highlight. That being said, I can also see a diff between the
contrib of the cpu0 in the tg_load. I'm going to look further

>
> --- bash start
> CGROUP_CPU=/sys/fs/cgroup/cpu/slice
> CGROUP_CPUSET=/sys/fs/cgroup/cpuset/slice
> CGROUP_CPUSET_ME=/sys/fs/cgroup/cpuset/me
> CPU=0
> CPU_ME=1
>
> function run_sandbox {
>   local CG_CPUSET="$1"
>   local CG_CPU="$2"
>   local INNER_SHARES="$3"
>   local CMD="$4"
>
>   local PIPE="$(mktemp -u)"
>   mkfifo "$PIPE"
>   sh -c "read < $PIPE ; exec $CMD" &
>   local TASK="$!"
>   sleep .1
>   mkdir -p "$CG_CPUSET"
>   mkdir -p "$CG_CPU"/sub
>   tee "$CG_CPU"/sub/cgroup.procs <<< "$TASK"
>   tee "$CG_CPU"/sub/cpu.shares <<< "$INNER_SHARES"
>
>   tee "$CG_CPUSET"/cgroup.procs <<< "$TASK"
>
>   tee "$PIPE" <<< sandox_done
>   rm "$PIPE"
> }
>
> mkdir -p "$CGROUP_CPU"
> mkdir -p "$CGROUP_CPUSET"
> mkdir -p "$CGROUP_CPUSET_ME"
>
> tee "$CGROUP_CPUSET"/cpuset.cpus <<< "$CPU"
> tee "$CGROUP_CPUSET"/cpuset.mems <<< "$CPU"
>
> tee "$CGROUP_CPUSET_ME"/cpuset.cpus <<< "$CPU_ME"
> echo $$ | tee "$CGROUP_CPUSET_ME"/cgroup.procs
>
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-1" 50000 "stress --cpu 1"
> run_sandbox "$CGROUP_CPUSET" "$CGROUP_CPU/cg-2" 2     "stress --cpu 1"
>
> read # click enter to cleanup and stop all stress procs
> killall stress
> sleep .2
> rmdir /sys/fs/cgroup/cpuset/slice/
> rmdir /sys/fs/cgroup/cpu/slice/{cg-{1,2}{/sub,},}
> --- bash end
>
>
> Thanks
> Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-27 10:55     ` Vincent Guittot
@ 2021-04-27 11:24       ` Odin Ugedal
  2021-04-27 12:44         ` Vincent Guittot
  0 siblings, 1 reply; 15+ messages in thread
From: Odin Ugedal @ 2021-04-27 11:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	linux-kernel

Hi,

> I wanted to say one v5.12-rcX version to make sure this is still a
> valid problem on latest version

Ahh, I see. No problem. :) Thank you so much for taking the time to
look at this!

> I confirm that I can see a ratio of 4ms vs 204ms running time with the
> patch below.

(I assume you talk about the bash code for reproducing, not the actual
sched patch.)

> But when I look more deeply in my trace (I have
> instrumented the code), it seems that the 2 stress-ng don't belong to
> the same cgroup but remained in cg-1 and cg-2 which explains such
> running time difference.

(mail reply number two to your previous mail might also help surface it)

Not sure if I have stated it correctly, or if we are talking about the
same thing. It _is_ the intention that the two procs should not be in the
same cgroup. In the same way as people create "containers", each proc runs
in a separate cgroup in the example. The issue is not the balancing
between the procs
themselves, but rather cgroups/sched_entities inside the cgroup hierarchy.
(due to the fact that the vruntime of those sched_entities end up
being calculated with more load than they are supposed to).

If you have any thought about the phrasing of the patch itself to make it
easier to understand, feel free to suggest.

Given the last cgroup v1 script, I get this:

- cat /proc/<stress-pid-1>/cgroup | grep cpu
11:cpu,cpuacct:/slice/cg-1/sub
3:cpuset:/slice

- cat /proc/<stress-pid-2>/cgroup | grep cpu
11:cpu,cpuacct:/slice/cg-2/sub
3:cpuset:/slice


The cgroup hierarchy will then roughly be like this (using cgroup v2 terms,
becuase I find them easier to reason about):

slice/
  cg-1/
    cpu.shares: 100
    sub/
      cpu.weight: 1
      cpuset.cpus: 1
      cgroup.procs - stress process 1 here
  cg-2/
    cpu.weight: 100
    sub/
      cpu.weight: 10000
      cpuset.cpus: 1
      cgroup.procs - stress process 2 here

This should result in 50/50 due to the fact that cg-1 and cg-2 both have a
weight of 100, and "live" inside the /slice cgroup. The inner weight should not
matter, since there is only one cgroup at that level.

> So your script doesn't reproduce the bug you
> want to highlight. That being said, I can also see a diff between the
> contrib of the cpu0 in the tg_load. I'm going to look further

There can definitely be some other issues involved, and I am pretty sure
you have way more knowledge about the scheduler than me... :) However,
I am pretty sure that it is in fact showing the issue I am talking about,
and applying the patch does indeed make it impossible to reproduce it
on my systems.

Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-27 11:24       ` Odin Ugedal
@ 2021-04-27 12:44         ` Vincent Guittot
  0 siblings, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2021-04-27 12:44 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	linux-kernel

On Tue, 27 Apr 2021 at 13:24, Odin Ugedal <odin@ugedal.com> wrote:
>
> Hi,
>
> > I wanted to say one v5.12-rcX version to make sure this is still a
> > valid problem on latest version
>
> Ahh, I see. No problem. :) Thank you so much for taking the time to
> look at this!
>
> > I confirm that I can see a ratio of 4ms vs 204ms running time with the
> > patch below.
>
> (I assume you talk about the bash code for reproducing, not the actual
> sched patch.)

yes sorry

>
> > But when I look more deeply in my trace (I have
> > instrumented the code), it seems that the 2 stress-ng don't belong to
> > the same cgroup but remained in cg-1 and cg-2 which explains such
> > running time difference.
>
> (mail reply number two to your previous mail might also help surface it)
>
> Not sure if I have stated it correctly, or if we are talking about the
> same thing. It _is_ the intention that the two procs should not be in the
> same cgroup. In the same way as people create "containers", each proc runs
> in a separate cgroup in the example. The issue is not the balancing
> between the procs
> themselves, but rather cgroups/sched_entities inside the cgroup hierarchy.
> (due to the fact that the vruntime of those sched_entities end up
> being calculated with more load than they are supposed to).
>
> If you have any thought about the phrasing of the patch itself to make it
> easier to understand, feel free to suggest.
>
> Given the last cgroup v1 script, I get this:
>
> - cat /proc/<stress-pid-1>/cgroup | grep cpu
> 11:cpu,cpuacct:/slice/cg-1/sub
> 3:cpuset:/slice
>
> - cat /proc/<stress-pid-2>/cgroup | grep cpu
> 11:cpu,cpuacct:/slice/cg-2/sub
> 3:cpuset:/slice
>
>
> The cgroup hierarchy will then roughly be like this (using cgroup v2 terms,
> becuase I find them easier to reason about):
>
> slice/
>   cg-1/
>     cpu.shares: 100
>     sub/
>       cpu.weight: 1
>       cpuset.cpus: 1
>       cgroup.procs - stress process 1 here
>   cg-2/
>     cpu.weight: 100
>     sub/
>       cpu.weight: 10000
>       cpuset.cpus: 1
>       cgroup.procs - stress process 2 here
>
> This should result in 50/50 due to the fact that cg-1 and cg-2 both have a
> weight of 100, and "live" inside the /slice cgroup. The inner weight should not
> matter, since there is only one cgroup at that level.
>
> > So your script doesn't reproduce the bug you
> > want to highlight. That being said, I can also see a diff between the
> > contrib of the cpu0 in the tg_load. I'm going to look further
>
> There can definitely be some other issues involved, and I am pretty sure
> you have way more knowledge about the scheduler than me... :) However,
> I am pretty sure that it is in fact showing the issue I am talking about,
> and applying the patch does indeed make it impossible to reproduce it
> on my systems.

Your script is correct. I was wrongly interpreting my trace. I have
been able to reproduce your problem and your analysis is correct. Let
me continue on the patch itself

>
> Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-25  8:09 ` [PATCH 1/1] " Odin Ugedal
@ 2021-04-27 14:26   ` Vincent Guittot
  2021-04-28 13:10     ` Odin Ugedal
  0 siblings, 1 reply; 15+ messages in thread
From: Vincent Guittot @ 2021-04-27 14:26 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel

Le dimanche 25 avril 2021 à 10:09:02 (+0200), Odin Ugedal a écrit :
> This fixes an issue where old load on a cfs_rq is not properly decayed,
> resulting in strange behavior where fairness can decrease drastically.
> Real workloads with equally weighted control groups have ended up
> getting a respective 99% and 1%(!!) of cpu time.
> 
> When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
> the old load of the task is attached to the new cfs_rq and sched_entity by
> attach_entity_cfs_rq. If the task is then moved to another cpu (and
> therefore cfs_rq) before being enqueued/woken up, the load will be moved
> to cfs_rq->removed from the sched_entity. Such a move will happen when
> enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

Would be good to mention that the problem happens only if the new cfs_rq has
been removed from the leaf_cfs_rq_list because its PELT metrics were already
null. In such case __update_blocked_fair() never updates the blocked load of
the new cfs_rq and never propagate the removed load in the hierarchy.

> 
> The load will however not be removed from the task_group itself, making
> it look like there is a constant load on that cfs_rq. This causes the
> vruntime of tasks on other sibling cfs_rq's to increase faster than they
> are supposed to; causing severe fairness issues. If no other task is
> started on the given cfs_rq, and due to the cpuset it would not happen,
> this load would never be properly unloaded. With this patch the load
> will be properly removed inside update_blocked_averages. This also
> applies to tasks moved to the fair scheduling class and moved to another
> cpu, and this path will also fix that. For fork, the entity is queued
> right away, so this problem does not affect that.
> 
> For a simple cgroup hierarchy (as seen below) with two equally weighted
> groups, that in theory should get 50/50 of cpu time each, it often leads
> to a load of 60/40 or 70/30.
> 
> parent/
>   cg-1/
>     cpu.weight: 100
>     cpuset.cpus: 1
>   cg-2/
>     cpu.weight: 100
>     cpuset.cpus: 1
> 
> If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
> equally weighted, they should still get a 50/50 balance of cpu time.
> This however sometimes results in a balance of 10/90 or 1/99(!!) between
> the task groups.
> 
> $ ps u -C stress
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
> root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1
> 
> parent/
>   cg-1/
>     cpu.weight: 100
>     sub-group/
>       cpu.weight: 1
>       cpuset.cpus: 1
>   cg-2/
>     cpu.weight: 100
>     sub-group/
>       cpu.weight: 10000
>       cpuset.cpus: 1
> 
> This can be reproduced by attaching an idle process to a cgroup and
> moving it to a given cpuset before it wakes up. The issue is evident in
> many (if not most) container runtimes, and has been reproduced
> with both crun and runc (and therefore docker and all its "derivatives"),
> and with both cgroup v1 and v2.
> 
> Fixes: 3d30544f0212 ("sched/fair: Apply more PELT fixes")

The fix tag should be :
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")

This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
skip useless update of blocked load.


> Signed-off-by: Odin Ugedal <odin@uged.al>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..ad7556f99b4a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10916,6 +10916,19 @@ static void attach_task_cfs_rq(struct task_struct *p)
>  
>  	if (!vruntime_normalized(p))
>  		se->vruntime += cfs_rq->min_vruntime;
> +
> +	/*
> +	 * Make sure the attached load will decay properly
> +	 * in case the task is moved to another cpu before
> +	 * being queued.
> +	 */
> +	if (!task_on_rq_queued(p)) {
> +		for_each_sched_entity(se) {
> +			if (se->on_rq)
> +				break;
> +			list_add_leaf_cfs_rq(cfs_rq_of(se));
> +		}
> +	}

propagate_entity_cfs_rq() already goes across the tg tree to
propagate the attach/detach.

would be better to call list_add_leaf_cfs_rq(cfs_rq)  inside this function
instead of looping twice the tg tree. Something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33b1ee31ae0f..18441ce7316c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
        for_each_sched_entity(se) {
                cfs_rq = cfs_rq_of(se);

-               if (cfs_rq_throttled(cfs_rq))
-                       break;
+               if (!cfs_rq_throttled(cfs_rq))
+                       update_load_avg(cfs_rq, se, UPDATE_TG);

-               update_load_avg(cfs_rq, se, UPDATE_TG);
+               list_add_leaf_cfs_rq(cfs_rq);
        }
 }
 #else


>
>  }
>  
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
> -- 
> 2.31.1
> 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-27 14:26   ` Vincent Guittot
@ 2021-04-28 13:10     ` Odin Ugedal
  2021-04-28 15:35       ` Vincent Guittot
  0 siblings, 1 reply; 15+ messages in thread
From: Odin Ugedal @ 2021-04-28 13:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	open list

Hi,

> Would be good to mention that the problem happens only if the new cfs_rq has
> been removed from the leaf_cfs_rq_list because its PELT metrics were already
> null. In such case __update_blocked_fair() never updates the blocked load of
> the new cfs_rq and never propagate the removed load in the hierarchy.

Well, it does technically occur when PELT metrics were null and therefore
removed from this leaf_cfs_rq_list, that is correct. We do however not add
newly created cfs_rq's to leaf_cfs_rq_list, so that is also a reason for it
to occur. Most users of cgroups are probably creating a new cgroup and then
attaching a process to it, so I think that will be the _biggest_ issue.

> The fix tag should be :
> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
>
> This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
> skip useless update of blocked load.

Thanks for pointing me at that patch! A quick look makes me think that that
commit caused the issue to occur _more often_, but was not the one that
introduced it. I should probably investigate a bit more tho., since I didn't
dig that deep in it. It is not a clean revert for that patch on v5.12,
but I did apply the diff below to test. It is essentially what the patch
039ae8bcf7a5 does, as far as I see. There might however been more commits
beteen those, so I might take a look further behind to see.

Doing this does make the problem less severe, resulting in ~90/10 load on the
example that without the diff results in ~99/1. So with this diff/reverting
039ae8bcf7a5, there is still an issue.

Should I keep two "Fixes", or should I just take one of them?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..5fac4fbf6f84 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7941,8 +7941,8 @@ static bool __update_blocked_fair(struct rq *rq,
bool *done)
                 * There can be a lot of idle CPU cgroups.  Don't let fully
                 * decayed cfs_rqs linger on the list.
                 */
-               if (cfs_rq_is_decayed(cfs_rq))
-                       list_del_leaf_cfs_rq(cfs_rq);
+               // if (cfs_rq_is_decayed(cfs_rq))
+               //      list_del_leaf_cfs_rq(cfs_rq);

                /* Don't need periodic decay once load/util_avg are null */
                if (cfs_rq_has_blocked(cfs_rq))

> propagate_entity_cfs_rq() already goes across the tg tree to
> propagate the attach/detach.
>
> would be better to call list_add_leaf_cfs_rq(cfs_rq)  inside this function
> instead of looping twice the tg tree. Something like:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 33b1ee31ae0f..18441ce7316c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
>
> -               if (cfs_rq_throttled(cfs_rq))
> -                       break;
> +               if (!cfs_rq_throttled(cfs_rq))
> +                       update_load_avg(cfs_rq, se, UPDATE_TG);
>
> -               update_load_avg(cfs_rq, se, UPDATE_TG);
> +               list_add_leaf_cfs_rq(cfs_rq);
>         }
>  }
>  #else


Thanks for that feedback!

I did think about that, but was not sure what would be the best one.
If it is "safe" to always run list_add_leaf_cfs_rq there (since it is used in
more places than just on cgroup change and move to fair class), I do agree
that that is a better solution. Will test that, and post a new patch
if it works as expected.

Also, the current code will exit from the loop in case a cfs_rq is throttled,
while your suggestion will keep looping. For list_add_leaf_cfs_rq that is fine
(and required), but should we keep running update_load_avg? I do think it is ok,
and the likelihood of a cfs_rq being throttled is not that high after all, so
I guess it doesn't really matter.

Thanks
Odin

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-28 13:10     ` Odin Ugedal
@ 2021-04-28 15:35       ` Vincent Guittot
  2021-04-28 15:50         ` Dietmar Eggemann
  2021-05-01 14:33         ` Odin Ugedal
  0 siblings, 2 replies; 15+ messages in thread
From: Vincent Guittot @ 2021-04-28 15:35 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	open list

On Wed, 28 Apr 2021 at 15:10, Odin Ugedal <odin@ugedal.com> wrote:
>
> Hi,
>
> > Would be good to mention that the problem happens only if the new cfs_rq has
> > been removed from the leaf_cfs_rq_list because its PELT metrics were already
> > null. In such case __update_blocked_fair() never updates the blocked load of
> > the new cfs_rq and never propagate the removed load in the hierarchy.
>
> Well, it does technically occur when PELT metrics were null and therefore
> removed from this leaf_cfs_rq_list, that is correct. We do however not add
> newly created cfs_rq's to leaf_cfs_rq_list, so that is also a reason for it

You're right that we wait for the 1st task to be enqueued to add the
cfs_rq in the list

> to occur. Most users of cgroups are probably creating a new cgroup and then
> attaching a process to it, so I think that will be the _biggest_ issue.

Yes,  I agree that according to your sequence, your problem mainly
comes from this and not the commit below

>
> > The fix tag should be :
> > Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> >
> > This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
> > skip useless update of blocked load.
>
> Thanks for pointing me at that patch! A quick look makes me think that that
> commit caused the issue to occur _more often_, but was not the one that
> introduced it. I should probably investigate a bit more tho., since I didn't
> dig that deep in it. It is not a clean revert for that patch on v5.12,
> but I did apply the diff below to test. It is essentially what the patch
> 039ae8bcf7a5 does, as far as I see. There might however been more commits
> beteen those, so I might take a look further behind to see.
>
> Doing this does make the problem less severe, resulting in ~90/10 load on the
> example that without the diff results in ~99/1. So with this diff/reverting
> 039ae8bcf7a5, there is still an issue.
>
> Should I keep two "Fixes", or should I just take one of them?

You can keep both fixes tags

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..5fac4fbf6f84 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7941,8 +7941,8 @@ static bool __update_blocked_fair(struct rq *rq,
> bool *done)
>                  * There can be a lot of idle CPU cgroups.  Don't let fully
>                  * decayed cfs_rqs linger on the list.
>                  */
> -               if (cfs_rq_is_decayed(cfs_rq))
> -                       list_del_leaf_cfs_rq(cfs_rq);
> +               // if (cfs_rq_is_decayed(cfs_rq))
> +               //      list_del_leaf_cfs_rq(cfs_rq);
>
>                 /* Don't need periodic decay once load/util_avg are null */
>                 if (cfs_rq_has_blocked(cfs_rq))
>
> > propagate_entity_cfs_rq() already goes across the tg tree to
> > propagate the attach/detach.
> >
> > would be better to call list_add_leaf_cfs_rq(cfs_rq)  inside this function
> > instead of looping twice the tg tree. Something like:
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 33b1ee31ae0f..18441ce7316c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
> >         for_each_sched_entity(se) {
> >                 cfs_rq = cfs_rq_of(se);
> >
> > -               if (cfs_rq_throttled(cfs_rq))
> > -                       break;
> > +               if (!cfs_rq_throttled(cfs_rq))
> > +                       update_load_avg(cfs_rq, se, UPDATE_TG);
> >
> > -               update_load_avg(cfs_rq, se, UPDATE_TG);
> > +               list_add_leaf_cfs_rq(cfs_rq);
> >         }
> >  }
> >  #else
>
>
> Thanks for that feedback!
>
> I did think about that, but was not sure what would be the best one.
> If it is "safe" to always run list_add_leaf_cfs_rq there (since it is used in

If the cfs_rq is already in the list list_add_leaf_cfs_rq() will exit
early but if it's not, we don't have to make sure that the whole
branch in the list

In fact, we can break as soon as list_add_leaf_cfs_rq() and
cfs_rq_throttled() return true

> more places than just on cgroup change and move to fair class), I do agree
> that that is a better solution. Will test that, and post a new patch
> if it works as expected.
>
> Also, the current code will exit from the loop in case a cfs_rq is throttled,
> while your suggestion will keep looping. For list_add_leaf_cfs_rq that is fine
> (and required), but should we keep running update_load_avg? I do think it is ok,

When a cfs_rq is throttled, it is not accounted in its parent anymore
so we don't have to update and propagate the load down.

> and the likelihood of a cfs_rq being throttled is not that high after all, so
> I guess it doesn't really matter.
>
> Thanks
> Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-28 15:35       ` Vincent Guittot
@ 2021-04-28 15:50         ` Dietmar Eggemann
  2021-05-01 14:41           ` Odin Ugedal
  2021-05-01 14:33         ` Odin Ugedal
  1 sibling, 1 reply; 15+ messages in thread
From: Dietmar Eggemann @ 2021-04-28 15:50 UTC (permalink / raw)
  To: Vincent Guittot, Odin Ugedal
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	open list

On 28/04/2021 17:35, Vincent Guittot wrote:
> On Wed, 28 Apr 2021 at 15:10, Odin Ugedal <odin@ugedal.com> wrote:
>>
>> Hi,
>>
>>> Would be good to mention that the problem happens only if the new cfs_rq has
>>> been removed from the leaf_cfs_rq_list because its PELT metrics were already
>>> null. In such case __update_blocked_fair() never updates the blocked load of
>>> the new cfs_rq and never propagate the removed load in the hierarchy.
>>
>> Well, it does technically occur when PELT metrics were null and therefore
>> removed from this leaf_cfs_rq_list, that is correct. We do however not add
>> newly created cfs_rq's to leaf_cfs_rq_list, so that is also a reason for it
> 
> You're right that we wait for the 1st task to be enqueued to add the
> cfs_rq in the list
> 
>> to occur. Most users of cgroups are probably creating a new cgroup and then
>> attaching a process to it, so I think that will be the _biggest_ issue.
> 
> Yes,  I agree that according to your sequence, your problem mainly
> comes from this and not the commit below
> 
>>
>>> The fix tag should be :
>>> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
>>>
>>> This patch re-introduced the del of idle cfs_rq from leaf_cfs_rq_list in order to
>>> skip useless update of blocked load.
>>
>> Thanks for pointing me at that patch! A quick look makes me think that that
>> commit caused the issue to occur _more often_, but was not the one that
>> introduced it. I should probably investigate a bit more tho., since I didn't
>> dig that deep in it. It is not a clean revert for that patch on v5.12,
>> but I did apply the diff below to test. It is essentially what the patch
>> 039ae8bcf7a5 does, as far as I see. There might however been more commits
>> beteen those, so I might take a look further behind to see.
>>
>> Doing this does make the problem less severe, resulting in ~90/10 load on the
>> example that without the diff results in ~99/1. So with this diff/reverting
>> 039ae8bcf7a5, there is still an issue.
>>
>> Should I keep two "Fixes", or should I just take one of them?
> 
> You can keep both fixes tags
> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 794c2cb945f8..5fac4fbf6f84 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7941,8 +7941,8 @@ static bool __update_blocked_fair(struct rq *rq,
>> bool *done)
>>                  * There can be a lot of idle CPU cgroups.  Don't let fully
>>                  * decayed cfs_rqs linger on the list.
>>                  */
>> -               if (cfs_rq_is_decayed(cfs_rq))
>> -                       list_del_leaf_cfs_rq(cfs_rq);
>> +               // if (cfs_rq_is_decayed(cfs_rq))
>> +               //      list_del_leaf_cfs_rq(cfs_rq);
>>
>>                 /* Don't need periodic decay once load/util_avg are null */
>>                 if (cfs_rq_has_blocked(cfs_rq))
>>
>>> propagate_entity_cfs_rq() already goes across the tg tree to
>>> propagate the attach/detach.
>>>
>>> would be better to call list_add_leaf_cfs_rq(cfs_rq)  inside this function
>>> instead of looping twice the tg tree. Something like:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 33b1ee31ae0f..18441ce7316c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>>>         for_each_sched_entity(se) {
>>>                 cfs_rq = cfs_rq_of(se);
>>>
>>> -               if (cfs_rq_throttled(cfs_rq))
>>> -                       break;
>>> +               if (!cfs_rq_throttled(cfs_rq))
>>> +                       update_load_avg(cfs_rq, se, UPDATE_TG);
>>>
>>> -               update_load_avg(cfs_rq, se, UPDATE_TG);
>>> +               list_add_leaf_cfs_rq(cfs_rq);
>>>         }
>>>  }
>>>  #else
>>
>>
>> Thanks for that feedback!
>>
>> I did think about that, but was not sure what would be the best one.
>> If it is "safe" to always run list_add_leaf_cfs_rq there (since it is used in
> 
> If the cfs_rq is already in the list list_add_leaf_cfs_rq() will exit
> early but if it's not, we don't have to make sure that the whole
> branch in the list
> 
> In fact, we can break as soon as list_add_leaf_cfs_rq() and
> cfs_rq_throttled() return true
> 
>> more places than just on cgroup change and move to fair class), I do agree
>> that that is a better solution. Will test that, and post a new patch
>> if it works as expected.
>>
>> Also, the current code will exit from the loop in case a cfs_rq is throttled,
>> while your suggestion will keep looping. For list_add_leaf_cfs_rq that is fine
>> (and required), but should we keep running update_load_avg? I do think it is ok,
> 
> When a cfs_rq is throttled, it is not accounted in its parent anymore
> so we don't have to update and propagate the load down.
> 
>> and the likelihood of a cfs_rq being throttled is not that high after all, so
>> I guess it doesn't really matter.

I think what I see on my Juno running the unfairness_missing_load_decay.sh script is
in sync which what you discussed here:

[   91.458079] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice/cg-1/sub delta=-769 cfs_rq->tg->load_avg=858->89 [sh 1690]
[   91.474596] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice/cg-1     delta=-32551 cfs_rq->tg->load_avg=32915->364 [ -1]
[   91.492881] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice          delta=-768 cfs_rq->tg->load_avg=776->8 [ -1]

...
[   91.514164]  dump_stack+0xd0/0x12c
[   91.517577]  attach_entity_cfs_rq+0xe4/0xec
[   91.521773]  task_change_group_fair+0x98/0x190      <---- !!!
[   91.526228]  sched_move_task+0x78/0x1e0
[   91.530078]  cpu_cgroup_attach+0x34/0x70
[   91.534013]  cgroup_migrate_execute+0x32c/0x3e4
[   91.538558]  cgroup_migrate+0x78/0x90
[   91.542231]  cgroup_attach_task+0x110/0x11c
[   91.546425]  __cgroup1_procs_write.constprop.0+0x128/0x170
...

[   91.597490] update_tg_load_avg: from=attach_entity_cfs_rq  *cpu2* cgroup=/slice/cg-2/sub delta=28 cfs_rq->tg->load_avg=0->28 [sh 1701]
[   91.609437] update_tg_load_avg: from=update_load_avg_2     cpu2 cgroup=/slice/cg-2     delta=27 cfs_rq->tg->load_avg=0->27 [ -1]
[   91.621033] update_tg_load_avg: from=update_load_avg_2     cpu2 cgroup=/slice          delta=27 cfs_rq->tg->load_avg=8->35 [ -1]
[   91.632763] update_tg_load_avg: from=__update_blocked_fair cpu0 cgroup=/slice/cg-1/sub delta=-7 cfs_rq->tg->load_avg=89->82 [ -1]
[   91.644470] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice          delta=-7 cfs_rq->tg->load_avg=35->28 [ -1]
[   91.656178] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice/cg-1     delta=-355 cfs_rq->tg->load_avg=364->9 [ -1]
[   91.656233] update_tg_load_avg: from=update_load_avg_2     cpu2 cgroup=/slice/cg-2     delta=-27 cfs_rq->tg->load_avg=27->0 [ -1]
[   91.668272] update_tg_load_avg: from=attach_entity_cfs_rq  cpu0 cgroup=/slice/cg-1/sub delta=1024 cfs_rq->tg->load_avg=82->1106 [stress 1706]
[   91.679707] update_tg_load_avg: from=update_load_avg_2     cpu2 cgroup=/slice          delta=-27 cfs_rq->tg->load_avg=28->1 [ -1]
[   91.692419] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice/cg-1     delta=46330 cfs_rq->tg->load_avg=9->46339 [ -1]
[   91.716193] update_tg_load_avg: from=update_load_avg_2     cpu0 cgroup=/slice          delta=1022 cfs_rq->tg->load_avg=1->1023 [ -1]

[   91.749936]  dump_stack+0xd0/0x12c
[   91.753348]  update_load_avg+0x460/0x490
[   91.757283]  enqueue_task_fair+0xe8/0x7fc
[   91.761303]  ttwu_do_activate+0x68/0x160
[   91.765239]  try_to_wake_up+0x1f4/0x594
...

[   91.833275] update_tg_load_avg: from=update_load_avg_1     *cpu0* cgroup=/slice/cg-2/sub delta=28 cfs_rq->tg->load_avg=28->56 [sh 1701]
[   91.845307] list_add_leaf_cfs_rq: cpu0 cgroup=/slice/cg-2/sub
[   91.851068] list_add_leaf_cfs_rq: cpu0 cgroup=/slice/cg-2

*cpu2* cgroup=/slice/cg-2 is not on the leaf_cfs_rq list.


root@juno:~# cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(:/slice)|load_avg"

...
cfs_rq[0]:/slice/cg-2
  .load_avg                      : 1
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1
  .tg_load_avg                   : 89     <--- !!!              
  .se->avg.load_avg              : 21
...

with the fix:

root@juno:~# cat /proc/sched_debug | grep ":/slice" -A 28 | egrep "(:/slice)|load_avg"

...
cfs_rq[0]:/slice/cg-2
  .load_avg                      : 2
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 2
  .tg_load_avg                   : 2     <--- !!!
  .se->avg.load_avg              : 1023
...

Snippet I used:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 794c2cb945f8..7214e6e89820 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10854,6 +10854,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
                        break;
 
                update_load_avg(cfs_rq, se, UPDATE_TG);
+               if (!cfs_rq_is_decayed(cfs_rq))
+                       list_add_leaf_cfs_rq(cfs_rq);
        }
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-28 15:35       ` Vincent Guittot
  2021-04-28 15:50         ` Dietmar Eggemann
@ 2021-05-01 14:33         ` Odin Ugedal
  1 sibling, 0 replies; 15+ messages in thread
From: Odin Ugedal @ 2021-05-01 14:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Odin Ugedal, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	open list

ons. 28. apr. 2021 kl. 17:36 skrev Vincent Guittot <vincent.guittot@linaro.org>:
> You can keep both fixes tags

ACK

> If the cfs_rq is already in the list list_add_leaf_cfs_rq() will exit
> early but if it's not, we don't have to make sure that the whole
> branch in the list

Yeah, thats right. Calling list_add_leaf_cfs_rq once "too much" doesnt
hurt after all.

> In fact, we can break as soon as list_add_leaf_cfs_rq() and
> cfs_rq_throttled() return true

ACK, that makes sense.

> When a cfs_rq is throttled, it is not accounted in its parent anymore
> so we don't have to update and propagate the load down.

Okay. Still need to wrap my head around this a bit more I guess. I
have looked a bit around, and there
is actually a similar issue as "this one" for the case when a
throttled cgroup is "moved" via cpuset. It is however waaay
harder to reproduce, but it is doable, and it _will_ happen in real
life systems if the timing is "correct". I will dig deeper
and finish the patch for that case some time next week (hopefully). I
think that however deserve a separate patchset,
so I will come back with that later.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 33b1ee31ae0f..18441ce7316c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11026,10 +11026,10 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>         for_each_sched_entity(se) {
>                 cfs_rq = cfs_rq_of(se);
>
> -               if (cfs_rq_throttled(cfs_rq))
> -                       break;
> +               if (!cfs_rq_throttled(cfs_rq))
> +                       update_load_avg(cfs_rq, se, UPDATE_TG);
>
> -               update_load_avg(cfs_rq, se, UPDATE_TG);
> +               list_add_leaf_cfs_rq(cfs_rq);
>         }
> }

Sent a v2 with something like this now; that exit if
(list_add_leaf_cfs_rq(cfs_rq) && throttled). Since this loop start at
the parent of
the cfs_rq of the supplied se, I added a list_add_leaf_cfs_rq to the
top in order to insert the leaf cfs_rq as well.

Thanks
Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-04-28 15:50         ` Dietmar Eggemann
@ 2021-05-01 14:41           ` Odin Ugedal
  2021-05-05  9:43             ` Dietmar Eggemann
  0 siblings, 1 reply; 15+ messages in thread
From: Odin Ugedal @ 2021-05-01 14:41 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Odin Ugedal, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	open list

Hi,

> I think what I see on my Juno running the unfairness_missing_load_decay.sh script is
> in sync which what you discussed here:

Thanks for taking a look!

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 794c2cb945f8..7214e6e89820 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10854,6 +10854,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>                         break;
>
>                 update_load_avg(cfs_rq, se, UPDATE_TG);
> +               if (!cfs_rq_is_decayed(cfs_rq))
> +                       list_add_leaf_cfs_rq(cfs_rq);
>         }
> }

This might however lead to "loss" at /slice/cg-2/sub and
slice/cg-1/sub in this particular case tho, since
propagate_entity_cfs_rq skips one cfs_rq
by taking the parent of the provided se. The loss in that case would
however not be equally big, but will still often contribute to some
unfairness.


Thanks
Odin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/1] sched/fair: Fix unfairness caused by missing load decay
  2021-05-01 14:41           ` Odin Ugedal
@ 2021-05-05  9:43             ` Dietmar Eggemann
  0 siblings, 0 replies; 15+ messages in thread
From: Dietmar Eggemann @ 2021-05-05  9:43 UTC (permalink / raw)
  To: Odin Ugedal
  Cc: Vincent Guittot, Odin Ugedal, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, open list:CONTROL GROUP (CGROUP),
	open list

On 01/05/2021 16:41, Odin Ugedal wrote:
> Hi,
> 
>> I think what I see on my Juno running the unfairness_missing_load_decay.sh script is
>> in sync which what you discussed here:
> 
> Thanks for taking a look!
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 794c2cb945f8..7214e6e89820 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10854,6 +10854,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>>                         break;
>>
>>                 update_load_avg(cfs_rq, se, UPDATE_TG);
>> +               if (!cfs_rq_is_decayed(cfs_rq))
>> +                       list_add_leaf_cfs_rq(cfs_rq);
>>         }
>> }
> 
> This might however lead to "loss" at /slice/cg-2/sub and
> slice/cg-1/sub in this particular case tho, since
> propagate_entity_cfs_rq skips one cfs_rq
> by taking the parent of the provided se. The loss in that case would
> however not be equally big, but will still often contribute to some
> unfairness.

Yeah, that's true.

By moving stopped `stress` tasks into

 /sys/fs/cgroup/cpu/slice/cg-{1,2}/sub

and then into

 /sys/fs/cgroup/cpuset/A

which has a cpuset.cpus {0-1,4-5} not containing the cpus the `stress`
tasks attached {2,3} to and then restart the `stress` tasks again I get:

cfs_rq[1]:/slice/cg-1/sub
  .load_avg                      : 1024
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1024  <---
  .tg_load_avg                   : 2047  <---
  .se->avg.load_avg              : 511
cfs_rq[1]:/slice/cg-1
  .load_avg                      : 512
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 512  <---
  .tg_load_avg                   : 1022 <---
  .se->avg.load_avg              : 512
cfs_rq[1]:/slice
  .load_avg                      : 513
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 513
  .tg_load_avg                   : 1024
  .se->avg.load_avg              : 512
cfs_rq[5]:/slice/cg-1/sub
  .load_avg                      : 1024
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 1023 <---
  .tg_load_avg                   : 2047 <---
  .se->avg.load_avg              : 511
cfs_rq[5]:/slice/cg-1
  .load_avg                      : 512
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 510  <---
  .tg_load_avg                   : 1022 <---
  .se->avg.load_avg              : 511
cfs_rq[5]:/slice
  .load_avg                      : 512
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 511
  .tg_load_avg                   : 1024
  .se->avg.load_avg              : 510

I saw that your v2 patch takes care of that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-05-05  9:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-25  8:09 [PATCH 0/1] sched/fair: Fix unfairness caused by missing load decay Odin Ugedal
2021-04-25  8:09 ` [PATCH 1/1] " Odin Ugedal
2021-04-27 14:26   ` Vincent Guittot
2021-04-28 13:10     ` Odin Ugedal
2021-04-28 15:35       ` Vincent Guittot
2021-04-28 15:50         ` Dietmar Eggemann
2021-05-01 14:41           ` Odin Ugedal
2021-05-05  9:43             ` Dietmar Eggemann
2021-05-01 14:33         ` Odin Ugedal
2021-04-26 14:58 ` [PATCH 0/1] " Vincent Guittot
2021-04-26 16:33   ` Odin Ugedal
2021-04-27 10:55     ` Vincent Guittot
2021-04-27 11:24       ` Odin Ugedal
2021-04-27 12:44         ` Vincent Guittot
2021-04-27  8:36   ` Odin Ugedal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).