All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dhaval Giani <dhaval.giani@gmail.com>
To: Vineeth Pillai <vineethrp@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>,
	Nishanth Aravamudan <naravamudan@digitalocean.com>,
	JulienDesfossez@google.com,
	Julien Desfossez <jdesfossez@digitalocean.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	mingo@kernel.org, Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, LKML <linux-kernel@vger.kernel.org>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Kees Cook <keescook@chromium.org>, Phil Auld <pauld@redhat.com>,
	Aaron Lu <aaron.lwe@gmail.com>,
	Aubrey Li <aubrey.intel@gmail.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Joel Fernandes <joelaf@google.com>, Chen Yu <yu.c.chen@intel.com>,
	Christian Brauner <christian.brauner@ubuntu.com>,
	chris hyser <chris.hyser@oracle.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	joshdon@google.com, xii@google.com, haoluo@google.com,
	Benjamin Segall <bsegall@google.com>
Subject: Re: [RFC] Design proposal for upstream core-scheduling interface
Date: Mon, 24 Aug 2020 13:31:42 -0700	[thread overview]
Message-ID: <CAPhKKr8V68SGMPkMqqQE+j1dM-7MBD8XPxf1t6s-gUzwoY_BsQ@mail.gmail.com> (raw)
In-Reply-To: <CAOBnfPjG9=XPCYOP6Hau5rZaKAEb4rYHG=5oORJt36X1_nFPOg@mail.gmail.com>

On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai <vineethrp@gmail.com> wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling.  We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> ----------------------------------------------------------------------------------
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
>              This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
>               Can be used for specifying future flag requirements as well.
>               Current list of properties include:
>               NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
>                                     creation of a new trust group
>               SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
>                                      this same group
>               ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - <pid>: A directory per task who is am member of this trust group.
> - <pid>/properties: This file is same as the parent properties file
> but this is to override
>                     the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
>    of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
>    write, read. Or can come up with new APIs using the cookie_id which can wrap
>   t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another psuedo filesystem.
> - very similar to  cgroup v1 and might be re-implementing features
> that are already
>   provided by cgroups v1.
>
> Use Cases
> -----------------
>
> Usecase 1: Google cloud
> ---------------------------------
>
> Since we no longer depend on cgroup v2 hierarchies, there will not be
> any issue of
> nesting and sharing. The main daemon can create trusted groups in the
> fileystem and
> provide required permissions for the group. Then the init processes
> for each job can
> be added to respective groups for them to create children tasks as
> needed. Multiple
> jobs under the same customer which needs to share the core can be
> housed in one group.
>
>
> Usecase 2: Chrome browser
> ------------------------
>
> We start with one group for the first task and then set properties to
> NEW_COOKIE_FOR_CHILD.
>
> Usecase 3: chrome VMs
> ---------------------
>
> Similar to chrome browser, the VM task can make each vcpu on its own group.
>
> Usecase 4: Oracle use case
> --------------------------
> This is also similar to use case 1 with this interface. All tasks that need to
> be in the root group can be easily added by the admin.
>
> Use case 5: General virtualization
> ----------------------------------
>
> The requirement is each VM should be isolated. This can be easily done
> by creating a
> new group per VM
>
>
> Please have a look at the above proposal and let us know your
> thoughts. We shall include
> this also during the interface discussion at core scheduling MC.
>

I am inclined to say no to this. Yet another FS interface :-(. We are
just reinventing the wheel here. Let's try to stick within cgroupfs
first and see if we can make it work there.

Dhaval

  reply	other threads:[~2020-08-24 20:32 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-22  3:01 [RFC] Design proposal for upstream core-scheduling interface Joel Fernandes
2020-08-24 11:32 ` Vineeth Pillai
2020-08-24 20:31   ` Dhaval Giani [this message]
2020-08-24 19:50 ` Dhaval Giani
2020-08-24 22:12   ` Joel Fernandes
2020-08-24 20:53 ` chris hyser
2020-08-24 21:42   ` chris hyser

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPhKKr8V68SGMPkMqqQE+j1dM-7MBD8XPxf1t6s-gUzwoY_BsQ@mail.gmail.com \
    --to=dhaval.giani@gmail.com \
    --cc=JulienDesfossez@google.com \
    --cc=aaron.lwe@gmail.com \
    --cc=aubrey.intel@gmail.com \
    --cc=bsegall@google.com \
    --cc=chris.hyser@oracle.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=fweisbec@gmail.com \
    --cc=haoluo@google.com \
    --cc=jdesfossez@digitalocean.com \
    --cc=joel@joelfernandes.org \
    --cc=joelaf@google.com \
    --cc=joshdon@google.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=naravamudan@digitalocean.com \
    --cc=pauld@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=valentin.schneider@arm.com \
    --cc=vineethrp@gmail.com \
    --cc=xii@google.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.