From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753353AbbHRUbW (ORCPT <rfc822;w@1wt.eu>);
	Tue, 18 Aug 2015 16:31:22 -0400
Received: from mail-pa0-f42.google.com ([209.85.220.42]:34145 "EHLO
	mail-pa0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751464AbbHRUbU (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 18 Aug 2015 16:31:20 -0400
Date: Tue, 18 Aug 2015 13:31:17 -0700
From: Tejun Heo <tj@kernel.org>
To: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        Johannes Weiner <hannes@cmpxchg.org>, lizefan@huawei.com,
        cgroups <cgroups@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>,
        kernel-team <kernel-team@fb.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 3/3] sched: Implement interface for cgroup unified
 hierarchy
Message-ID: <20150818203117.GC15739@mtj.duckdns.org>
References: <1438641689-14655-1-git-send-email-tj@kernel.org>
 <1438641689-14655-4-git-send-email-tj@kernel.org>
 <20150804090711.GL25159@twins.programming.kicks-ass.net>
 <20150804151017.GD17598@mtj.duckdns.org>
 <20150805091036.GT25159@twins.programming.kicks-ass.net>
 <20150805143132.GK17598@mtj.duckdns.org>
 <CAPM31RJTf0v=2v90kN6-HM9xUGab_k++upO0Ym=irmfO9+BbFw@mail.gmail.com>
 <CAPM31R+5=4bGQo++PrYoFuS86_7JqhhQ0OtPvCYooAzJsvhb=w@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAPM31R+5=4bGQo++PrYoFuS86_7JqhhQ0OtPvCYooAzJsvhb=w@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello, Paul.

On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote:
> > 2) Control within an address-space.  For subsystems with fungible resources,
> > e.g. CPU, it can be useful for an address space to partition its own
> > threads.  Losing the capability to do this against the CPU controller would
> > be a large set-back for instance.  Occasionally, it is useful to share these
> > groupings between address spaces when processes are cooperative, but this is
> > less of a requirement.
> >
> > This is important to us.

Sure, let's build a proper interface for that.  Do you actually need
sub-hierarchy inside a process?  Can you describe your use case in
detail and why having hierarchical CPU cycle distribution is essential
for your use case?

> >> And that's one of the major fuck ups on cgroup's part that must be
> >> rectified.  Look at the interface being proposed there.  It's exposing
> >> direct hardware details w/o much abstraction which is fine for a
> >> system management interface but at the same time it's intended to be
> >> exposed to individual applications.
> >
> > FWIW this is something we've had no significant problems managing with
> > separate mount mounts and file system protections.  Yes, there are some
> > potential warts around atomicity; but we've not found them too onerous.

You guys control the whole stack.  Of course, you can get away with an
interface which are pretty messed up in terms of layering and
isolation; however, generic kernel interface cannot be designed
according to that standard.

> > What I don't quite follow here is the assumption that CAT should would be
> > necessarily exposed to individual applications? What's wrong with subsystems
> > that are primarily intended only for system management agents, we already
> > have several of these.

Why would you assume that threads of a process wouldn't want to
configure it ever?  How is this different from CPU affinity?

> >> This lack of distinction makes
> >> people skip the attention that they should be paying when they're
> >> designing interface exposed to individual programs.  Worse, this makes
> >> these things fly under the review scrutiny that public API accessible
> >> to applications usually receives.  Yet, that's what these things end
> >> up to be.  This just has to stop.  cgroups can't continue to be this
> >> ghetto shortcut to implementing half-assed APIs.
> >
> > I certainly don't disagree on this point :).  But as above, I don't quite
> > follow why an API being in cgroups must mean it's accessible to an
> > application controlled by that group.  This has certainly not been a
> > requirement for our use.

I don't follow what you're trying to way with the above paragraph.
Are you still talking about CAT?  If so, that use case isn't the only
one.  I'm pretty sure there are people who would want to configure
cache allocation at thread level.

> >> What we should be doing is pushing them into the same arena as any
> >> other publicly accessible API.  I don't think there can be a shortcut
> >> to this.
> >
> > Are you explicitly opposed to non-hierarchical partitions, however?  Cpuset
> > is [typically] an example of this, where the interface wants to control
> > unified properties across a set of processes.  Without necessarily being
> > usefully hierarchical.  (This is just to understand your core position, I'm
> > not proposing cpuset should shape *anything*.)

I'm having trouble following what you're trying to say.  FWIW, cpuset
is fully hierarchical.

> >> I don't think we want migration in sub-process hierarchy but in the
> >> off chance we do the naming can follow the same pid/program
> >> group/session id scheme, which, again, is a lot easier to deal with
> >> from applications.
> >
> > I don't have many objections with hand-off versus migration above, however,
> > I think that this is a big drawback.  Threads are expensive to create and
> > are often cached rather than released.  While migration may be expensive,
> > creating a more thread is more so.  The important to reconfigure a thread's
> > personality at run-time is important.

The core problem here is picking the hot path.  If cgroups as a whole
doesn't pick a position here, controllers have to assume that
migration might not be a very cold path which naturally leads to
overall designs and synchronization schemes which concede hot path
performance to accomodate migration.  We simply can't afford to do
that - we end up losing way more in way hotter paths for something
which may be marginally useful in some corner cases.

So, this is a trade-off we're consciously making.  If there are
common-enough use cases which require jumping across different cgroup
domains, we'll try to figure out a way to accomodate those but by
default migration is a very cold and expensive path.

> >> But those are relative to the current directory per operation and
> >> there's no way to define a transaction across multiple file
> >> operations.  There's no way to prevent a process from being migrated
> >> inbetween openat() and subsequent write().
> >
> > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another
> > way to address some of these issues.

That sounds horrible to me.  What if the process wants to do RMW a
config?  What if the permissions are different after an intervening
migration?  What if the sub-hierarchy no longer exists or has been
replaced by a hierarchy with the same topology but actualy is a
different one?

> > I don't quite agree here.  Losing per-thread control within the cpu
> > controller is likely going to mean that much of it ends up being
> > reimplemented as some duplicate-in-appearance interface that gets us back to
> > where we are today.  I recognize that these controllers (cpu, cpuacct) are
> > square pegs in that per-process makes sense for most other sub-systems; but
> > unfortunately, their needs and use-cases are real / dependent on their
> > present form.

Let's build an API which actually looks and behaves like an API which
is properly isolated from what external agents may do to the process.
I can't see how that would be "back to where we are today".  All of
those are pretty critical attributes for a public kernel API and
utterly broken right now.

Thanks.

-- 
tejun