From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753371AbcJDOrW (ORCPT ); Tue, 4 Oct 2016 10:47:22 -0400 Received: from mail-yw0-f178.google.com ([209.85.161.178]:33873 "EHLO mail-yw0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752015AbcJDOrU (ORCPT ); Tue, 4 Oct 2016 10:47:20 -0400 Date: Tue, 4 Oct 2016 10:47:17 -0400 From: Tejun Heo To: Peter Zijlstra Cc: Andy Lutomirski , Ingo Molnar , Mike Galbraith , "linux-kernel@vger.kernel.org" , kernel-team@fb.com, "open list:CONTROL GROUP (CGROUP)" , Andrew Morton , Paul Turner , Li Zefan , Linux API , Johannes Weiner , Linus Torvalds Subject: Re: [Documentation] State of CPU controller in cgroup v2 Message-ID: <20161004144717.GA4205@htj.duckdns.org> References: <20160829222048.GH28713@mtj.duckdns.org> <20160831173251.GY12660@htj.duckdns.org> <20160831210754.GZ12660@htj.duckdns.org> <20160903220526.GA20784@mtj.duckdns.org> <20160906102950.GK10153@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160906102950.GK10153@twins.programming.kicks-ass.net> User-Agent: Mutt/1.7.0 (2016-08-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Peter. On Tue, Sep 06, 2016 at 12:29:50PM +0200, Peter Zijlstra wrote: > The fundamental problem is that we have 2 different types of > controllers, on the one hand these controllers above, that work on tasks > and form groups of them and build up from that. Lets call them > task-controllers. > > On the other hand we have controllers like memcg which take the 'system' > as a whole and shrink it down into smaller bits. Lets call these > system-controllers. > > They are fundamentally at odds with capabilities, simply because of the > granularity they can work on. As pointed out multiple times, the picture is not that simple. For example, eventually, we want to be able to account for cpu cycles spent during memory reclaim or processing IOs (e.g. encryption), which can only be tied to the resource domain, not a specific task. There surely are things that can only be done by task-level controllers, but there are two different aspects here. One is the actual capabilities (e.g. hierarchical proportional cpu cycle distribution) and the other is how such capabilities are exposed. I'll continue below. > Merging the two into a common hierarchy is a useful concept for > containerization, no argument on that, esp. when also coupled with > namespaces and the like. Great, we now agree that comprehensive system resource control is useful. > However, where I object _most_ strongly is having this one use dominate > and destroy the capabilities (which are in use) of the task-controllers. The objection isn't necessarily just about loss of capabilities but also about not being able to do them in the same way as v1. The reason I proposed rgroup instead of scoped task-granularity is because I think that a properly insulated programmable interface which is inline with other widely used APIs is a better solution in the long run. If we go cgroupfs route for thread granularity, we pretty much lose the possibility, or at least make it very difficult, to make hierarchical resource control widely available to individual applications. How important such use cases are is debatable. I don't find it too difficult to imagine scenarios where individual applications like apache or torrent clients make use of it. Probably more importantly, rgroup, or something like it, gives an application an officially supported way to build and expose their resource hierarchies, which can then be used by both the application itself and outside to monitor and manipulate resource distribution. The decision between cgroupfs thread granularity and something like rgroup isn't an obvious one. Choosing the former is the path of lower resistance but it is so at the cost of certain long-term benefits. > > It could be made to work without races, though, with minimal (or even > > no) ABI change. The managed program could grab an fd pointing to its > > cgroup. Then it would use openat, etc for all operations. As long as > > 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, > > we're fine. > > I've mentioned openat() and related APIs several times, but so far never > got good reasons why that wouldn't work. Hopefully, this part was addressed in my reply to Andy. > cgroup-v2, by placing the system style controllers first and foremost, > completely renders that scenario impossible. Note also that any proposed > rgroup would not work for this, since that, per design, is a subtree, > and therefore not disjoint. If a use case absolutely requires disjoint resource hierarchies, the only solution is to keep using multiple v1 hierarchies, which necessarily excludes the possibility of doing anyting across different resource types. > So my objection to the whole cgroup-v2 model and implementation stems > from the fact that it purports to be a 'better' and 'improved' system, > while in actuality it neuters and destroys a lot of useful usecases. > > It completely disregards all task-controllers and labels their use-cases > as irrelevant. Your objection then doesn't have much to do with the specifics of the cgroup v2 model or implementation. It's an objection against establishing common resource domains as that excludes building orthogonal multiple hierarchies. That, necessarily, can only be achieved by having multiple hierarchies for different resource types and thus giving up the benefits of common resource domains. Assuming that, I don't think your position is against cgroup v2 but more toward keeping v1 around. We're talking about two quite different mutually exclusive classes of use cases. You need unified for one and disjoint for the other. v1 is gonna be there and can easily be used alongside v2 for different controller types, which would in most cases be cpu and cpuset. I can't see a reason why this would need to block properly supporting containerization use cases. Thanks. -- tejun