Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: Daniel Bristot de Oliveira <bristot@redhat.com>
To: Luca Abeni <luca.abeni@unitn.it>, linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@arm.com>,
	Claudio Scordino <claudio@evidence.eu.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Subject: Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE
Date: Tue, 3 Jan 2017 19:58:38 +0100	[thread overview]
Message-ID: <6c4ab4ec-7164-fafe-5efc-990f3cf31269@redhat.com> (raw)
In-Reply-To: <1483097591-3871-1-git-send-email-lucabe72@gmail.com>

On 12/30/2016 12:33 PM, Luca Abeni wrote:
> From: Luca Abeni <luca.abeni@unitn.it>
> 
> Hi all,
> 
> here is a new version of the patchset implementing CPU reclaiming
> (using the GRUB algorithm[1]) for SCHED_DEADLINE.
> Basically, this feature allows SCHED_DEADLINE tasks to consume more
> than their reserved runtime, up to a maximum fraction of the CPU time
> (so that other tasks are left some spare CPU time to execute), if this
> does not break the guarantees of other SCHED_DEADLINE tasks.
> The patchset applies on top of tip/master.
> 
> 
> The implemented CPU reclaiming algorithm is based on tracking the
> utilization U_act of active tasks (first 2 patches), and modifying the
> runtime accounting rule (see patch 0004). The original GRUB algorithm is
> modified as described in [2] to support multiple CPUs (the original
> algorithm only considered one single CPU, this one tracks U_act per
> runqueue) and to leave an "unreclaimable" fraction of CPU time to non
> SCHED_DEADLINE tasks (see patch 0005: the original algorithm can consume
> 100% of the CPU time, starving all the other tasks).
> Patch 0003 uses the newly introduced "inactive timer" (introduced in
> patch 0002) to fix dl_overflow() and __setparam_dl().
> Patch 0006 allows to enable CPU reclaiming only on selected tasks.

Hi,

Today I did some tests in this patch set. Unfortunately, it seems that
there is a problem :-(.

In a four core box, if I dispatch 11 tasks [1] with setup:

  period = 30 ms
  runtime = 10 ms
  flags = 0 (GRUB disabled)

I see this:
------------------------------- HTOP ------------------------------------
  1  [|||||||||||||||||||||92.5%]   Tasks: 128, 259 thr; 14 running
  2  [|||||||||||||||||||||91.0%]   Load average: 4.65 4.66 4.81 
  3  [|||||||||||||||||||||92.5%]   Uptime: 05:12:43
  4  [|||||||||||||||||||||92.5%]
  Mem[|||||||||||||||1.13G/3.78G]
  Swp[                  0K/3.90G]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
16247 root      -101   0  4204   632   564 R 32.4  0.0  2:10.35 d
16249 root	-101   0  4204   624   556 R 32.4  0.0  2:09.80 d
16250 root	-101   0  4204   728   660 R 32.4  0.0  2:09.58 d
16252 root	-101   0  4204   676   608 R 32.4  0.0  2:09.08 d
16253 root	-101   0  4204   636   568 R 32.4  0.0  2:08.85 d
16254 root      -101   0  4204   732   664 R 32.4  0.0  2:08.62 d
16255 root	-101   0  4204   620   556 R 32.4  0.0  2:08.40 d
16257 root	-101   0  4204   708   640 R 32.4  0.0  2:07.98 d
16256 root	-101   0  4204   624   560 R 32.4  0.0  2:08.18 d
16248 root	-101   0  4204   680   612 R 33.0  0.0  2:10.15 d
16251 root	-101   0  4204   676   608 R 33.0  0.0  2:09.34 d
16259 root       20   0  124M  4692  3120 R  1.1  0.1  0:02.82 htop
 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:28.77 gnome-ter
------------------------------- HTOP ------------------------------------

All tasks are using +- the same amount of CPU time, a little bit more
than 30%, as expected. However, if I enable GRUB in the same task set
I get this:

------------------------------- HTOP ------------------------------------
  1  [|||||||||||||||||||||93.8%]   Tasks: 128, 260 thr; 15 running
  2  [|||||||||||||||||||||95.2%]   Load average: 5.13 5.01 4.98 
  3  [|||||||||||||||||||||93.3%]   Uptime: 05:01:02
  4  [|||||||||||||||||||||96.4%]
  Mem[|||||||||||||||1.13G/3.78G]
  Swp[                  0K/3.90G]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14967 root      -101   0  4204   628   564 R 45.8  0.0  1h07:49 g
14962 root	-101   0  4204   728   660 R 45.8  0.0  1h05:06 g
14959 root	-101   0  4204   680   612 R 45.2  0.0  1h07:29 g
14927 root	-101   0  4204   624   556 R 44.6  0.0  1h04:30 g
14928 root	-101   0  4204   656   588 R 31.1  0.0 47:37.21 g
14961 root	-101   0  4204   684   616 R 31.1  0.0 47:19.75 g
14968 root	-101   0  4204   636   568 R 31.1  0.0 46:27.36 g
14960 root	-101   0  4204   684   616 R 23.8  0.0 37:31.06 g
14969 root	-101   0  4204   684   616 R 23.8  0.0 38:11.50 g
14925 root	-101   0  4204   636   568 R 23.8  0.0 37:34.88 g
14926 root	-101   0  4204   684   616 R 23.8  0.0 38:27.37 g
16182 root	 20   0  124M  3972  3212 R  0.6  0.1  0:00.23 htop
  862 root       20   0  264M  5668  4832 S  0.6  0.1  0:03.30 iio-sensor
 2191 bristot    20   0  649M 41312 32048 S  0.0  1.0  0:27.62 gnome-term
  588 root       20   0  257M  121M  120M S  0.0  3.1  0:13.53 systemd-jo
------------------------------- HTOP ------------------------------------

Some tasks start to use more CPU time, while others seems to use less
CPU than it was reserved for them. See the task 14926, it is using
only 23.8 % of the CPU, which is less than its 10/30 reservation.

I traced this task activation and noticed this:

         swapper     0 [003] 14968.332244: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.339294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7050 us (14968.339294 - 14968.332244)

period:  29997 us (14968.362241 - 14968.332244)
         swapper     0 [003] 14968.362241: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.369294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 0.007053 us (14968.369294 = 14968.362241)

period: 29994 us (14968.392235 - 14968.362241)
         swapper     0 [003] 14968.392235: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.399301: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7066 us (14968.399301 - 14968.392235)

period:  30008 us (14968.422243 - 14968.392235)
         swapper     0 [003] 14968.422243: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.429294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7051 us (14968.429294 - 14968.422243)

period:  29995 us (14968.452238 - 14968.422243)
         swapper     0 [003] 14968.452238: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
               g 14926 [003] 14968.459293: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7055 us (14968.459293 - 14968.452238)

period:  30055 us (14968.482293 - 14968.452238)
               g 14925 [003] 14968.482293: sched:sched_switch: g:14925 [-1] R ==> g:14926 [-1]
               g 14926 [003] 14968.490293: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 8000 us (14968.490293 - 14968.482293)

The task is using less CPU than it was reserved/guaranteed.

After some debugging, it seems that in this case GRUB is also _reducing_
the runtime of the task by making the notion of consumed runtime
be greater than the actual consumed runtime.

You can see this with this code snip:

------------------- %<-------------------

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 93ff400..1abb594 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
 
 	sched_rt_avg_update(rq, delta_exec);
 
-	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
-		delta_exec = grub_reclaim(delta_exec, rq);
-	dl_se->runtime -= delta_exec;
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+		u64 new_delta_exec;
+		new_delta_exec = grub_reclaim(delta_exec, rq);
+		if (new_delta_exec > delta_exec)
+			trace_printk("new delta exec (%llu) is greater than delta exec (%llu) by %llu\n",
+					new_delta_exec,
+					delta_exec,
+					(new_delta_exec - delta_exec));
+		dl_se->runtime -= new_delta_exec;
+	}
+	else {
+		dl_se->runtime -= delta_exec;
+	}
+
+
 
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
--------------------------- >% -------------

It seems to be related to the "sched/deadline: do not reclaim the whole
CPU bandwidth", because the trace_printk message I put starts to appear
when we start to touch this limit, and the (new_delta_exec - delta_exec)
seems to be somehow limited to the non_deadline_bw.

Output with sysctl -w kernel.sched_rt_runtime_us=950000
               g-1984  [001] d.h1  1108.783349: update_curr_dl: new delta exec (1050043) is greater than delta exec (1000042) by 50001
               g-1983  [002] d.h1  1108.783349: update_curr_dl: new delta exec (1049974) is greater than delta exec (999976) by 49998
               g-1981  [003] d.h1  1108.783350: update_curr_dl: new delta exec (1050054) is greater than delta exec (1000053) by 50001

Output with sysctl -w kernel.sched_rt_runtime_us=900000
               g-1748  [001] d.h1   418.879815: update_curr_dl: new delta exec (1099995) is greater than delta exec (999996) by 99999
               g-1749  [002] d.h1   418.880815: update_curr_dl: new delta exec (1099986) is greater than delta exec (999988) by 99998
               g-1748  [001] d.h1   418.880815: update_curr_dl: new delta exec (1099962) is greater than delta exec (999966) by 99996

In the case of fewer tasks, this error appears just in the
dispatch of a new task, stabilizing after some ms. But it
does not stabilize when we are closer to the limit of the rt
runtime.

That is all I could find today. Am I missing something?

[1] http://bristot.me/lkml/d.c

-- Daniel