From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933476AbcILPUN (ORCPT ); Mon, 12 Sep 2016 11:20:13 -0400 Received: from mail-oi0-f46.google.com ([209.85.218.46]:36846 "EHLO mail-oi0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932245AbcILPUJ (ORCPT ); Mon, 12 Sep 2016 11:20:09 -0400 Subject: Re: [Documentation] State of CPU controller in cgroup v2 To: Tejun Heo , Andy Lutomirski References: <20160820155659.GA16906@mtj.duckdns.org> <20160829222048.GH28713@mtj.duckdns.org> <20160831173251.GY12660@htj.duckdns.org> <20160831210754.GZ12660@htj.duckdns.org> <20160903220526.GA20784@mtj.duckdns.org> <20160909225747.GA30105@mtj.duckdns.org> Cc: Ingo Molnar , Mike Galbraith , "linux-kernel@vger.kernel.org" , kernel-team@fb.com, "open list:CONTROL GROUP (CGROUP)" , Andrew Morton , Paul Turner , Li Zefan , Linux API , Peter Zijlstra , Johannes Weiner , Linus Torvalds From: "Austin S. Hemmelgarn" Message-ID: Date: Mon, 12 Sep 2016 11:20:03 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.3.0 MIME-Version: 1.0 In-Reply-To: <20160909225747.GA30105@mtj.duckdns.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2016-09-09 18:57, Tejun Heo wrote: > Hello, again. > > On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote: >>> * It doesn't bring any practical benefits in terms of capability. >>> Userland can trivially handle the system-root and namespace-roots in >>> a symmetrical manner. >> >> Your idea of "trivially" doesn't match mine. You gave a use case in > > I suppose I wasn't clear enough. It is trivial in the sense that if > the userland implements something which works for namespace-root, it > would work the same in system-root without further modifications. > >> which userspace might take advantage of root being special. If > > I was emphasizing the cases where userspace would have to deal with > the inherent differences, and, when they don't, they can behave > exactly the same way. > >> userspace does that, then that userspace cannot be run in a container. >> This could be a problem for real users. Sure, "don't do that" is a >> *valid* answer, but it's not a very helpful answer. > > Great, now we agree that what's currently implemented is valid. I > think you're still failing to recognize the inherent specialness of > the system-root and how much unnecessary pain the removal of the > exemption would cause at virtually no practical gain. I won't repeat > the same backing points here. > >>> * It's an unncessary inconvenience, especially for cases where the >>> cgroup agent isn't in control of boot, for partial usage cases, or >>> just for playing with it. >>> >>> You say that I'm ignoring the same use case for namespace-scope but >>> namespace-roots don't have the same hybrid function for partial and >>> uncontrolled systems, so it's not clear why there even NEEDS to be >>> strict symmetry. >> >> I think their functions are much closer than you think they are. I >> want a whole Linux distro to be able to run in a container. This >> means that useful things people do in a distro or initramfs or >> whatever should just work if containerized. > > There isn't much which is getting in the way of doing that. Again, > something which follows no-internal-task rule would behave the same no > matter where it is. The system-root is different in that it is exempt > from the rule and thus is more flexible but that difference is serving > the purpose of handling the inherent specialness of the system-root. > AFAICS, it is the solution which causes the least amount of contortion > and unnecessary inconvenience to userland. > >>> It's easy and understandable to get hangups on asymmetries or >>> exemptions like this, but they also often are acceptable trade-offs. >>> It's really frustrating to see you first getting hung up on "this must >>> be wrong" and even after explanations repeating the same thing just in >>> different ways. >>> >>> If there is something fundamentally wrong with it, sure, let's fix it, >>> but what's actually broken? >> >> I'm not saying it's fundamentally wrong. I'm saying it's a design > > You were. > >> that has a big wart, and that wart is unfortunate, and after thinking >> a bit, I'm starting to agree with PeterZ that this is problematic. It >> also seems fixable: the constraint could be relaxed. > > You've been pushing for enforcing the restriction on the system-root > too and now are jumping to the opposite end. It's really frustrating > that this is such a whack-a-mole game where you throw ideas without > really thinking through them and only concede the bare minimum when > all other logical avenues are closed off. Here, again, you seem to be > stating a strong opinion when you haven't fully thought about it or > tried to understand the reasons behind it. > > But, whatever, let's go there: Given the arguments that I laid out for > the no-internal-tasks rule, how does the problem seem fixable through > relaxing the constraint? > >>>>>> Also, here's an idea to maybe make PeterZ happier: relax the >>>>>> restriction a bit per-controller. Currently (except for /), if you >>>>>> have subtree control enabled you can't have any processes in the >>>>>> cgroup. Could you change this so it only applies to certain >>>>>> controllers? If the cpu controller is entirely happy to have >>>>>> processes and cgroups as siblings, then maybe a cgroup with only cpu >>>>>> subtree control enabled could allow processes to exist. >>>>> >>>>> The document lists several reasons for not doing this and also that >>>>> there is no known real world use case for such configuration. >>> >>> So, up until this point, we were talking about no-internal-tasks >>> constraint. >> >> Isn't this the same thing? IIUC the constraint in question is that, >> if a non-root cgroup has subtree control on, then it can't have >> processes in it. This is the no-internal-tasks constraint, right? > > Yes, that is what no-internal-tasks rule is but I don't understand how > that is the same thing as process granularity. Am I completely > misunderstanding what you are trying to say here? > >> And I still think that, at least for cpu, nothing at all goes wrong if >> you allow processes to exist in cgroups that have cpu set in >> subtree-control. > > If you confine it to the cpu controller, ignore anonymous > consumptions, the rather ugly mapping between nice and weight values > and the fact that nobody could come up with a practical usefulness for > such setup, yes. My point was never that the cpu controller can't do > it but that we should find a better way of coordinating it with other > controllers and exposing it to individual applications. So, having a container where not everything in the container is split further into subgroups is not a practically useful situation? Because that's exactly what both systemd and every other cgroup management tool expects to have work as things stand right now. The root cgroup within a cgroup namespace has to function exactly like the system-root, otherwise nothing can depend on the special cases for the system root, because they might get run in a cgroup namespace and such assumptions will be invalid. This in turn means that no current distro can run unmodified in a cgroup namespace under a v2 hierarchy, which is a Very Bad Thing. > >> ----- begin talking about process granularity ----- > ... >>> I do. It's a horrible userland API to expose to individual >>> applications if the organization that a given application expects can >>> be disturbed by system operations. Imagine how this would be >>> documented - "if this operation races with system operation, it may >>> return -ENOENT. Repeating the path lookup might make the operation >>> succeed again." >> >> It could be made to work without races, though, with minimal (or even >> no) ABI change. The managed program could grab an fd pointing to its >> cgroup. Then it would use openat, etc for all operations. As long as >> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working, >> we're fine. > > After a migration, the cgroup and its interface knobs are a different > directory and files. Semantically, during migration, we aren't moving > the directory or files and it'd be bizarre to overlay the semantics > you're describing on top of the existing cgroupfs. We will have to > break away from the very basic vfs rules such as a fd, once opened, > always corresponding to the same file. The only thing openat(2) does > is abstracting away prefix handling and that is only a small part of > the problem. > > A more acceptable way could be implementing, say, per-task filesystem > which always appears at the fixed location and proxies the operations; > however, even this wouldn't be able to handle issues stemming from > lack of actual atomicity. Think about two tasks accessing the same > interface file. If they race against outside agent migrating them > one-by-one, they may or may not be accessing the same file. If they > perform operations with side effects such as config changes, creation > of sub-cgroups and migrations, what would be the end result? > > In addition, a per-task filesystem is an a lot worse interface to > program against than a system-call based API, especially when the same > API which is used to do the exact same operations on threads can be > reused for resource groups. > >> Note that this pretty much has to work if cgroup namespaces are to >> allow rearrangement of the hierarchy -- '/cgroup/' from inside the >> namespace has to remain valid at all times > > If I'm not mistaken, namespaces don't allow this type of dynamic > migrations. > >> Obviously this only works if the cgroup in question doesn't itself get >> destroyed, but having an internal hierarchy is a bit nonsensical if >> the application shares a cgroup with another application, so that >> shouldn't be a problem in practice. >> >> In fact, ISTM that allowing applications to manage cgroup >> sub-hierarchies has almost exactly the same set of constraints as >> allowing namespaced cgroup managers to work. In a container, the >> outer manager manages where the container lives and the container >> manages its own hierarchy. Why can't fancy cgroup-aware applications >> work exactly the same way? > > System agents and individual applications are different. This is the > same argument that you brought up earlier in this thread where you > said that userland can just set up namespaces for individual > applications. In purely mathematical terms, they can be mapped to > each other but that grossly ignores practical differences between > them. > > Most applications should and want to keep their assumptions > conservative, robust and portable, and not dependent on some crazy > fragile and custom-built namespace setup that nobody in the stack is > really responsible for. How many would ever program against something > like that? > > A system agent has a large part of the system configuration under its > control (it's the system agent after all) and thus is way more > flexible in what assumptions it can dictate and depend on. > >>> Yeah, systemd has delegation feature for cases like that which we >>> depend on too. >>> >>> As for your example, who performs the cgroup setup and configuration, >>> the application itself or an external entity? If an external entity, >>> how does it know which thread is what? >> >> In my case, it would be a little script that reads a config file that >> knows all kinds of internal information about the application and its >> threads. > > I see. One-of-a-kind custom setup. This is a completely valid usage; > however, please also recognize that it's an extremely specific one > which is niche by definition. If we're going to support > in-application hierarchical resource control, I think it's very > important to make sure that it's something which is easy to use and > widely accessible so that any lay application can make use of it. > I'll come back to this point later. > >>> And, as for rgroup not covering it, would extending rgroup to cover >>> multi-process cases be enough or are there more fundamental issues? >> >> Maybe, as long as the configuration could actually be created -- IIUC >> the current rgroup proposal requires that the hierarchy of groups >> matches the hierarchy implied by clone(), which isn't going to happen >> in my case. > > We can make that dynamic as long as the subtree is properly scoped; > however, there is an important design decision to make here. If we > open up full-on dynamic migrations to individual applications, we > commit ourselves to supporting arbitrarily high frequency migration > operations, which we've never supported before and will restrict what > we can do in terms of optimizing hot paths over migration. > > We haven't had to face this decision because cgroup has never properly > supported delegating to applications and the in-use setups where this > happens are custom configurations where there is no boundary between > system and applications and adhoc trial-and-error is good enough a way > to find a working solution. That wiggle room goes away once we > officially open this up to individual applications. > > So, if we decide to open up dynamic assignment, we need to weigh what > we gain in terms of capabilities against reduction of implementation > maneuvering room. I guess there can be a middleground where, for > example, only initial asssignment is allowed. > > It is really difficult to understand your position without > understanding where the requirements are coming from. Can you please > elaborate more on the workload? Why is the specific configuration > useful? What is it trying to achieve? > >> But, given that this fancy-cgroup-aware-multiprocess-application case >> looks so much like cgroup-using container, ISTM you could solve the >> problem completely by just allowing tasks to be split out by users who >> want to do it. (Obviously those users will get funny results if they >> try to do this to memcg. "Don't do that" seems fine here.) I don't >> expect the race condition issues you're worried about to happen in >> practice. Certainly not in my case, since I control the entire >> system. > > What people do now with cgroup inside an application is extremely > limited. Because there is no proper support for it, each use case has > to craft up a dedicated custom setup which is all but guaranteed to be > incompatible with what someone else would come up for another > application. Everybody is in "this is mine, I control the entire > system" mindset, which is fine for those specific setups but > deterimental to making it widely available and useful. > > Accepting some measured restrictions and building a common ground for > everyone can make in-application cgroup usages vastly more accessible > and useful than now. Certain things would need to be done differently > and maybe some scenarios won't be supported as well but those are > trade-offs that we'd need to weigh against what we gain. Another > point is that, for very specific use cases where none of these generic > concerns matter, keeping using cgroup v1 is fine. The lack of common > resource domains has never been an issue for those use cases anyway. > > Thanks. >