Re: [RFC] Design proposal for upstream core-scheduling interface

From: Vineeth Pillai <vineethrp@gmail.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Nishanth Aravamudan <naravamudan@digitalocean.com>,
	JulienDesfossez@google.com,
	Julien Desfossez <jdesfossez@digitalocean.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	mingo@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>,
	linux-kernel@vger.kernel.org, fweisbec@gmail.com,
	Kees Cook <keescook@chromium.org>, Phil Auld <pauld@redhat.com>,
	Aaron Lu <aaron.lwe@gmail.com>,
	Aubrey Li <aubrey.intel@gmail.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Joel Fernandes <joelaf@google.com>, Chen Yu <yu.c.chen@intel.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	chris hyser <chris.hyser@oracle.com>,
	dhaval.giani@gmail.com, "Paul E . McKenney" <paulmck@kernel.org>,
	joshdon@google.com, xii@google.com, haoluo@google.com,
	bsegall@google.com
Subject: Re: [RFC] Design proposal for upstream core-scheduling interface
Date: Mon, 24 Aug 2020 07:32:05 -0400	[thread overview]
Message-ID: <CAOBnfPjG9=XPCYOP6Hau5rZaKAEb4rYHG=5oORJt36X1_nFPOg@mail.gmail.com> (raw)
In-Reply-To: <20200822030155.GA414063@google.com>

> Let me know your thoughts and looking forward to a good LPC MC discussion!
>

Nice write up Joel, thanks for taking time to compile this with great detail!

After going through the details of interface proposal using cgroup v2
controllers,
and based on our discussion offline, would like to note down this idea
about a new
pseudo filesystem interface for core scheduling.  We could include
this also for the
API discussion during core scheduler MC.

coreschedfs: pseudo filesystem interface for Core Scheduling
----------------------------------------------------------------------------------

The basic requirement of core scheduling is simple - we need to group a set
of tasks into a trust group that can share a core. So we don’t really
need a nested
hierarchy for the trust groups. Cgroups v2 follow a unified nested
hierarchy model
that causes a considerable confusion if the trusted tasks are in
different levels of the
hierarchy and we need to allow them to share the core. Cgroup v2's
single hierarchy
model makes it difficult to regroup tasks in different levels of
nesting for core scheduling.
As noted in this mail, we could use multi-file approach and other
interfaces like prctl to
overcome this limitation.

The idea proposed here to overcome the above limitation is to come up with a new
pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
filesystem with
maximum nesting level of 1. That means, root directory can have
sub-directories for
sub-groups, but those sub-directories cannot have more sub-directories
representing
trust groups. Root directory is to represent the system wide trust
group and sub-directories
represent trusted groups. Each directory including the root directory
has the following set
of files/directories:

- cookie_id: User exposed id for a cookie. This can be compared to a
file descriptor.
             This could be used in programmatic API to join/leave a group

- properties: This is an interface to specify how child tasks of this
group should behave.
              Can be used for specifying future flag requirements as well.
              Current list of properties include:
              NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
will result in
                                    creation of a new trust group
              SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
group will end up in
                                     this same group
              ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
group goes to the root group

- tasks: Lists the tasks in this group. Main interface for adding
removing tasks in a group

- <pid>: A directory per task who is am member of this trust group.
- <pid>/properties: This file is same as the parent properties file
but this is to override
                    the group setting.

This pseudo filesystem can be mounted any where in the root
filesystem, I propose the default
to be in “/sys/kernel/coresched”

When coresched is enabled, kernel internally creates the framework for
this filesystem.
The filesystem gets mounted to the default location and admin can
change this if needed.
All tasks by default are in the root group. The admin or programs can
then create trusted
groups on top of this filesystem.

Hooks will be placed in fork() and exit() to make sure that the
filesystem’s view of tasks is
up-to-date with the system. Also, APIs manipulating core scheduling
trusted groups should
also make sure that the filesystem's view is updated.

Note: The above idea is very similar to cgroups v1. Since there is no
unified hierarchy
in cgroup v1, most of the features of coreschedfs could be implemented
as a cgroup v1
controller. As no new v1 controllers are allowed, I feel the best
alternative to have
a simple API is to come up with a new filesystem - coreschedfs.

The advantages of this approach is:

- Detached from cgroup unified hierarchy and hence the very simple requirement
   of core scheduling can be easily materialized.
- Admin can have fine-grained control of groups using shell and scripting
- Can have programmatic access to this using existing APIs like mkdir,rmdir,
   write, read. Or can come up with new APIs using the cookie_id which can wrap
  t he above linux apis or use a new systemcall for core scheduling.
- Fine grained permission control using linux filesystem permissions and ACLs

Disadvantages are
- yet another psuedo filesystem.
- very similar to  cgroup v1 and might be re-implementing features
that are already
  provided by cgroups v1.

Use Cases
-----------------

Usecase 1: Google cloud
---------------------------------

Since we no longer depend on cgroup v2 hierarchies, there will not be
any issue of
nesting and sharing. The main daemon can create trusted groups in the
fileystem and
provide required permissions for the group. Then the init processes
for each job can
be added to respective groups for them to create children tasks as
needed. Multiple
jobs under the same customer which needs to share the core can be
housed in one group.

Usecase 2: Chrome browser
------------------------

We start with one group for the first task and then set properties to
NEW_COOKIE_FOR_CHILD.

Usecase 3: chrome VMs
---------------------

Similar to chrome browser, the VM task can make each vcpu on its own group.

Usecase 4: Oracle use case
--------------------------
This is also similar to use case 1 with this interface. All tasks that need to
be in the root group can be easily added by the admin.

Use case 5: General virtualization
----------------------------------

The requirement is each VM should be isolated. This can be easily done
by creating a
new group per VM

Please have a look at the above proposal and let us know your
thoughts. We shall include
this also during the interface discussion at core scheduling MC.

Thanks,
Vineeth