From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753823AbdHXUcZ (ORCPT <rfc822;w@1wt.eu>);
        Thu, 24 Aug 2017 16:32:25 -0400
Received: from mail-lf0-f48.google.com ([209.85.215.48]:35068 "EHLO
        mail-lf0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753727AbdHXUcW (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 24 Aug 2017 16:32:22 -0400
MIME-Version: 1.0
In-Reply-To: <20170824095326.4f5c1777@luca>
References: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org>
 <20170822142136.3604336e@luca> <CANLsYkxjHvx37+kNK8SKFM3NFz2G1vuzDCDuuQA9N_QvYCJbNg@mail.gmail.com>
 <20170824095326.4f5c1777@luca>
From: Mathieu Poirier <mathieu.poirier@linaro.org>
Date: Thu, 24 Aug 2017 14:32:20 -0600
Message-ID: <CANLsYkzdKbwzqOcqv=ku7FEFRTM6dot48mNfBdRMR0uwOvjcww@mail.gmail.com>
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting
To: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        tj@kernel.org, vbabka@suse.cz, Li Zefan <lizefan@huawei.com>,
        akpm@linux-foundation.org, weiyongjun1@huawei.com,
        Juri Lelli <juri.lelli@arm.com>, Steven Rostedt <rostedt@goodmis.org>,
        Claudio Scordino <claudio@evidence.eu.com>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 24 August 2017 at 01:53, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> On Wed, 23 Aug 2017 13:47:13 -0600
> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
>> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
>> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
>> >> operations.  When CPUhotplug and some CUPset manipulation take place root
>> >> domains are destroyed and new ones created, loosing at the same time DL
>> >> accounting pertaining to utilisation.
>> >
>> > Thanks for looking at this longstanding issue! I am just back from
>> > vacations; in the next days I'll try your patches.
>> > Do you have some kind of scripts for reproducing the issue
>> > automatically? (I see that in the original email Steven described how
>> > to reproduce it manually; I just wonder if anyone already scripted the
>> > test).
>>
>> I didn't bother scripting it since it is so easy to do.  I'm eager to
>> see how things work out on your end.
>
> Ok, so I'll try to reproduce the issue manually as described in Steven's
> original email; I'll run some tests as soon as I finish with some stuff
> that accumulated during vacations.
>
> [...]
>> >> OPEN ISSUE:
>> >>
>> >> Regardless of how we proceed (using existing CPUset list or new ones) we
>> >> need to deal with DL tasks that span more than one root domain,  something
>> >> that will typically happen after a CPUset operation.  For example, if we
>> >> split the number of available CPUs on a system in two CPUsets and then turn
>> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
>> >> parent CPUset will end up spanning two root domains.
>> >>
>> >> One way to deal with this is to prevent CPUset operations from happening
>> >> when such condition is detected, as enacted in this set.
>> >
>> > I think this is the simplest (if not only?) solution if we want to use
>> > gEDF in each root domain.
>>
>> Global Earliest Deadline First?  Is my interpretation correct?
>
> Right. As far as I understand, the original SCHED_DEADLINE design is to
> partition the CPUs in disjoint sets, and then use global EDF scheduling
> on each one of those sets (this guarantees bounded tardiness, and if
> you run some additional admission tests in user space you can also
> guarantee the hard respect of every deadline).
>
>
>> >> Although simple
>> >> this approach feels brittle and akin to a "whack-a-mole" game.  A better
>> >> and more reliable approach would be to teach the DL scheduler to deal with
>> >> tasks that span multiple root domains, a serious and substantial
>> >> undertaking.
>> >>
>> >> I am sending this as a starting point for discussion.  I would be grateful
>> >> if you could take the time to comment on the approach and most importantly
>> >> provide input on how to deal with the open issue underlined above.
>> >
>> > I suspect that if we want to guarantee bounded tardiness then we have to
>> > go for a solution similar to the one suggested by Tommaso some time ago
>> > (if I remember well):
>> >
>> > if we want to create some "second level cpusets" inside a "parent
>> > cpuset", allowing deadline tasks to be placed inside both the "parent
>> > cpuset" and the "second level cpusets", then we have to subtract the
>> > "second level cpusets" maximum utilizations from the "parent cpuset"
>> > utilization.
>> >
>> > I am not sure how difficult it can be to implement this...
>>
>> Humm...  I am missing some context here.
>
> Or maybe I misunderstood the issue you were seeing (I am no expert on
> cpusets). Is it related to hierarchies of cpusets (with one cpuset
> contained inside another one)?

Having spent a lot of time in the CPUset code, I can understand the confusion.

CPUset allows to create a hierarchy of sets, _seemingly_ creating
overlapping root domains.  Fortunately that isn't the case -
overlapping CPUsets are morphed together to create non-overlapping
root domains.  The magic happens in rebuild_sched_domains_locked() [1]
where generate_sched_domains() [2] transforms any CPUset topology into
disjoint domains.

> Can you describe how to reproduce the problematic situation?

Let's start with a 4 CPU system (in this case the Q401c Dragon board)
where patches 1/7 and 2/7 have been applied to a vanilla kernel.  I'm
also using Juri's tools [3,4] as describe in Steve's email [5].

root@linaro-developer:/home/linaro# uname -a
Linux linaro-developer 4.13.0-rc5-00012-g98bf1310205e #149 SMP PREEMPT
Thu Aug 24 13:12:39 MDT 2017 aarch64 GNU/Linux
root@linaro-developer:/home/linaro#
root@linaro-developer:/home/linaro# cat /sys/devices/system/cpu/online
0-3
root@linaro-developer:/home/linaro#
root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
root@linaro-developer:/home/linaro#

This checks out as expected.  Now let's create 2 CPUsets and make sure
new root domains are created by setting the 'sched_load_balance' flag
to '0' on the default CPUset.

root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > set1/cpuset.cpus
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 2,3 > set2/cpuset.cpus
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@linaro-developer:/sys/fs/cgroup/cpuset#

At this time runqueue0 and runqueue1 point to root domain A while
runqueue2 and runqueue3 point to root domain B (something that can't
be seen without adding more instrumentation).  Newly created tasks can
roam on all the CPUs available:


root@linaro-developer:/home/linaro# ./burn &
[1] 3973
root@linaro-developer:/home/linaro# grep Cpus_allowed: /proc/3973/status
Cpus_allowed: f
root@linaro-developer:/home/linaro#

The above demonstrate that even if we have two CPUsets new task belong
to the "default" CPUset and as such can use all the available CPUs.
Now let's make task 3973 a DL task:

root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 3973
root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
  dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0                  <------ Problem
  dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0                  <------ Problem
  dl_rq[2]:
  .dl_nr_running                 : 1
  .dl_nr_migratory               : 1
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 943718        <------ As expected
  dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 943718        <------ As expected
root@linaro-developer:/home/linaro/jlelli#

When task 3973 was promoted to a DL task it was running on either CPU2
or CPU3.  The acceptance test was done on root domain B and the task
utilisation added as expected.  But as pointed out above task 3973 can
still be scheduled on CPU0 and CPU1 and that is a problem since the
utilisation hasn't been added there as well.  The task is now spread
over two root domains rather than a single one, as currently expected
by the DL code (note that there are many ways to reproduce this
situation).

In its current form the patchset prevents specific operations from
being carried out if we recognise that a task could end up spanning
more than a single root domain.  But that will break as soon as we
find a new way to create a DL task that spans multiple domains (and I
may not have caught them all either).

Another way to fix this is to do an acceptance test on all the root
domain of a task.  So above we'd run the acceptance test on root
domain A and B before promoting the task.  Of course we'd also have to
add the utilisation of that task to both root domain.  Although simple
it goes at the core of the DL scheduler and touches pretty much every
aspect of it, something I'm reluctant to embark on.

[1]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814
[2]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634
[3]. https://github.com/jlelli/tests.git
[4]. https://github.com/jlelli/schedtool-dl.git
[5]. https://lkml.org/lkml/2016/2/3/966

>
>> Nonetheless the approach I
>> was contemplating was to repeat the current mathematics to all the
>> root domains accessible from a p->cpus_allowed's flag.
>
> I think in the original SCHED_DEADLINE design there should be only one
> root domain compatible with the task's affinity... If this does not
> happen, I suspect it is a bug (Juri, can you confirm?).
>
> My understanding is that with SCHED_DEADLINE cpusets should be used to
> partition the system's CPUs in disjoint sets (and I think there is one
> root domain for each one of those disjoint sets). And the task affinity
> mask should correspond with the CPUs composing the set in which the
> task is executing.
>
>
>> As such we'd
>> have the same acceptance test but repeated to more than one root
>> domain.  To do that time can be an issue but the real problem I see is
>> related to the current DL code.  It is geared around a single root
>> domain and changing that means meddling in a lot of places.  I had a
>> prototype that was beginning to address that but decided to gather
>> people's opinion before getting in too deep.
>
> I still do not fully understand this (I got the impression that this is
> related to hierarchies of cpusets, but I am not sure if this
> understanding is correct). Maybe an example would help me to understand.

The above should say it all - please get back to me if I haven't
expressed myself clearly.

>
>
>
>                         Thanks,
>                                 Luca