Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Tejun Heo <tj@kernel.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>,
	linux-kernel@vger.kernel.org, vikas.shivappa@intel.com,
	x86@kernel.org, hpa@zytor.com, tglx@linutronix.de,
	mingo@kernel.org, peterz@infradead.org, matt.fleming@intel.com,
	will.auld@intel.com, glenn.p.williamson@intel.com,
	kanaka.d.juvva@intel.com
Subject: Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management
Date: Tue, 4 Aug 2015 09:55:20 -0300	[thread overview]
Message-ID: <20150804125520.GA31450@amt.cnet> (raw)
In-Reply-To: <20150803203250.GA31668@amt.cnet>

On Mon, Aug 03, 2015 at 05:32:50PM -0300, Marcelo Tosatti wrote:
> On Sun, Aug 02, 2015 at 12:23:25PM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Fri, Jul 31, 2015 at 12:12:18PM -0300, Marcelo Tosatti wrote:
> > > > I don't really think it makes sense to implement a fully hierarchical
> > > > cgroup solution when there isn't the basic affinity-adjusting
> > > > interface 
> > > 
> > > What is an "affinity adjusting interface" ? Can you give an example
> > > please?
> > 
> > Something similar to sched_setaffinity().  Just a syscall / prctl or
> > whatever programmable interface which sets per-task attribute.
> 
> You really want to specify the cache configuration "at once": 
> having process-A exclusive access to 2MB of cache at all times,
> and process-B 4MB exclusive, means you can't have process-C use 4MB of 
> cache exclusively (consider 8MB cache machine).

Thats not true. Its fine to setup the 

	task set <--> cache portion

mapping in pieces.

In fact, its more natural because you don't necessarily know in advance
the entire cache allocation (think of "cp largefile /destination" with
sequential use-once behavior).

However, there is a use-case for sharing: in scenario 1 it might be
possible (and desired) to share code between applications.

> > > > and it isn't clear whether fully hierarchical resource
> > > > distribution would be necessary especially given that the granularity
> > > > of the target resource is very coarse.
> > > 
> > > As i see it, the benefit of the hierarchical structure to the CAT
> > > configuration is simply to organize sharing of cache ways in subtrees
> > > - two cgroups can share a given cache way only if they have a common
> > > parent. 
> > > 
> > > That is the only benefit. Vikas, please correct me if i'm wrong.
> > 
> > cgroups is not a superset of a programmable interface.  It has
> > distinctive disadvantages and not a substitute with hirearchy support
> > for regular systemcall-like interface.  I don't think it makes sense
> > to go full-on hierarchical cgroups when we don't have basic interface
> > which is likely to cover many use cases better.  A syscall-like
> > interface combined with a tool similar to taskset would cover a lot in
> > a more accessible way.
> 
> How are you going to specify sharing of portions of cache by two sets
> of tasks with a syscall interface?
> 
> > > > I can see that how cpuset would seem to invite this sort of usage but
> > > > cpuset itself is more of an arbitrary outgrowth (regardless of
> > > > history) in terms of resource control and most things controlled by
> > > > cpuset already have countepart interface which is readily accessible
> > > > to the normal applications.
> > > 
> > > I can't parse that phrase (due to ignorance). Please educate.
> > 
> > Hmmm... consider CPU affinity.  cpuset definitely is useful for some
> > use cases as a management tool especially if the workloads are not
> > cooperative or delegated; however, it's no substitute for a proper
> > syscall interface and it'd be silly to try to replace that with
> > cpuset.
> > 
> > > > Given that what the feature allows is restricting usage rather than
> > > > granting anything exclusively, a programmable interface wouldn't need
> > > > to worry about complications around priviledges
> > > 
> > > What complications about priviledges you refer to?
> > 
> > It's not granting exclusive access, so individual user applications
> > can be allowed to do whatever it wanna do as long as the issuer has
> > enough priv over the target task.
> 
> Priviledge management with cgroup system: to change cache allocation
> requires priviledge over cgroups.
> 
> Priviledge management with system call interface: applications 
> could be allowed to reserve up to a certain percentage of the cache.
> 
> > > > while being able to reap most of the benefits in an a lot easier way.
> > > > Am I missing something?
> > > 
> > > The interface does allow for exclusive cache usage by an application.
> > > Please read the Intel manual, section 17, it is very instructive.
> > 
> > For that, it'd have to require some CAP but I think just having
> > restrictive interface in the style of CPU or NUMA affinity would go a
> > long way.
> > 
> > > The use cases we have now are the following:
> > > 
> > > Scenario 1: Consider a system with 4 high performance applications
> > > running, one of which is a streaming application that manages a very
> > > large address space from which it reads and writes as it does its processing.
> > > As such the application will use all the cache it can get but does
> > > not need much if any cache. So, it spoils the cache for everyone for no
> > > gain on its own. In this case we'd like to constrain it to the
> > > smallest possible amount of cache while at the same time constraining
> > > the other 3 applications to stay out of this thrashed area of the
> > > cache.
> > 
> > A tool in the style of taskset should be enough for the above
> > scenario.
> > 
> > > Scenario 2: We have a numeric application that has been highly optimized
> > > to fit in the L2 cache (2M for example). We want to ensure that its
> > > cached data does not get flushed from the cache hierarchy while it is
> > > scheduled out. In this case we exclusively allocate enough L3 cache to
> > > hold all of the L2 cache.
> > >
> > > Scenario 3: Latency sensitive application executing in a shared
> > > environment, where memory to handle an event must be in L3 cache
> > > for latency requirements to be met.
> > 
> > Either isolate CPUs or run other stuff with affinity restricted.
> > 
> > cpuset-style allocation can be easier for things like this but that
> > should be an addition on top not the one and only interface.  How is
> > it gonna handle if multiple threads of a process want to restrict
> > cache usages to avoid stepping on each other's toes?  Delegate the
> > subdirectory and let the process itself open it and write to files to
> > configure when there isn't even a way to atomically access the
> > process's own directory or a way to synchronize against migration?
> 
> One would preconfigure that in advance - but you are right, a 
> syscall interface is more flexible in that respect.

So, systemd is responsible for locking.

> > cgroups may be an okay management interface but a horrible
> > programmable interface.
> > 
> > Sure, if this turns out to be as important as cpu or numa affinity and
> > gets widely used creating management burden in many use cases, we sure
> > can add cgroups controller for it but that's a remote possibility at
> > this point and the current attempt is over-engineering solution for
> > problems which haven't been shown to exist.  Let's please first
> > implement something simple and easy to use.
> > 
> > Thanks.
> > 
> > -- 
> > tejun

Don't see an easy way to fix the sharing use-case (it would require
exposing the "intersection" between two task sets).

Can't "cacheset" helper (similar to taskset) talk to systemd
to achieve the flexibility you point ?