From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756165AbbHYVCl (ORCPT ); Tue, 25 Aug 2015 17:02:41 -0400 Received: from mail-yk0-f169.google.com ([209.85.160.169]:32788 "EHLO mail-yk0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751042AbbHYVCi (ORCPT ); Tue, 25 Aug 2015 17:02:38 -0400 Date: Tue, 25 Aug 2015 17:02:34 -0400 From: Tejun Heo To: Paul Turner Cc: Peter Zijlstra , Ingo Molnar , Johannes Weiner , lizefan@huawei.com, cgroups , LKML , kernel-team , Linus Torvalds , Andrew Morton Subject: Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Message-ID: <20150825210234.GE26785@mtj.duckdns.org> References: <20150818203117.GC15739@mtj.duckdns.org> <20150822182916.GE20768@mtj.duckdns.org> <20150824213600.GK28944@mtj.duckdns.org> <20150824221935.GN28944@mtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote: > > This is an erratic behavior on cpuset's part tho. Nothing else > > behaves this way and it's borderline buggy. > > It's actually the only sane possible interaction here. > > If you don't overwrite the masks you can no longer manage cpusets with > a multi-threaded application. > If you partially overwrite the masks you can create a host of > inconsistent behaviors where an application suddenly loses > parallelism. It's a layering problem. It'd be fine if cpuset either did "layer per-thread affinities below w/ config change notification" or "ignore and/or reject per-thread affinities". What we have now is two layers manipulating the same field without any mechanism for coordination. > The *only* consistent way is to clobber all masks uniformly. Then > either arrange for some notification to the application to re-sync, or > use sub-sub-containers within the cpuset hierarchy to advertise > finer-partitions. I don't get it. How is that the only consistent way? Why is making irreversible changes even a good way? Just layer the masks and trigger notification on changes. > I don't think the case of having a large compute farm with > "unimportant" and "important" work is particularly fringe. Reducing > the impact on the "important" work so that we can scavenge more cycles > for the latency insensitive "unimportant" is very real. What if optimizing cache allocation across competing threads of a process can yield, say, 3% gain across large compute farm? Is that fringe? > Right, but it's exactly because of _how bad_ those other mechanisms > _are_ that cgroups was originally created. Its growth was not > managed well from there, but let's not step away from the fact that > this interface was created to solve this problem. Sure, at the same time, please don't forget that there are ample reasons we can't replace more basic mechanisms with cgroups. I'm not saying this can't be part of cgroup but rather that it's misguided to do plunge into cgroups as the first and only step. More importantly, I am extremely doubtful that we understand the usage scenarios and their benefits very well at this point and want to avoid over-committing to something we'll look back and regret. As it currently stands, this has a high likelihood of becoming a mismanaged growth. For the cache allocation thing, I'd strongly suggest something way simpler and non-commmittal - e.g. create a char device node with simple configuration and basic access control. If this *really* turns out to be useful and its configuration complex enough to warrant cgroup integration, let's do it then, and if we actually end up there, I bet the interface that we'd come up with at that point would be different from what people are proposing now. > > Yeah, I understand the similarity part but don't buy that the benefit > > there is big enough to introduce a kernel API which is expected to be > > used by individual programs which is radically different from how > > processes / threads are organized and applications interact with the > > kernel. > > Sorry, I don't quite follow, in what way is it radically different? > What is magically different about a process versus a thread in this > sub-division? I meant that cgroupfs as opposed to most other programming interfaces that we publish to applications. We already have process / thread hierarchy which is created through forking/cloning and conventions built around them for interaction. No sane application programming interface requires individual applications to open a file somewhere, echo some values to it and use directory operations to manage its organization. Will get back to this later. > > All controllers only get what their ancestors can hand down to them. > > That's basic hierarchical behavior. > > And many users want non work-conserving systems in which we can add > and remove idle resources. This means that how much bandwidth an > ancestor has is not fixed in stone. I'm having a hard time following you on this part of the discussion. Can you give me an example? > > If that's the case and we fail miserably at creating a reasonable > > programming interface for that, we can always revive thread > > granularity. This is mostly a policy decision after all. > > These interfaces should be presented side-by-side. This is not a > reasonable patch-later part of the interface as we depend on it today. Revival of thread affinity is trivial and will stay that way for a long time and the transition is already gradual, so it'll be a lost opportunity but there is quite a bit of maneuvering room. Anyways, on with the sub-process interface. Skipping description of the problems with the current setup here as I've repated it a couple times in this thread already. On the other sub-thread, I said that process/thread tree and cgroup association are inherently tied. This is because a new child task is always born into the parent's cgroup and the only reason cgroup works on system management use cases is because system management often controls enough of how processes are created. The flexible migration that cgroup supports may suggest that an external agent with enough information can define and manage sub-process hierarchy without involving the target application but this doesn't necessarily work because such information is often only available in the application itself and the internal thread hierarchy should be agreeable to the hierarchy that's being imposed upon it - when threads are dynamically created, different parts of the hierarchy should be created by different parent thread. Also, the problem with external and in-application manipulations stepping on each other's toes is mostly not caused by individual config settings but by the possibility that they may try to set up different hierarchies or modify the existing one in a way which is not expected by the other. Given that thread hierarchy already needs to be compatible with resource hierarchy, is something unix programs already understands and thus can render itself to an a lot more conventional interface which doesn't cause organizational conflicts, I think it's logical to use that for sub-process resource distribution. So, it comes down to sth like the following set_resource($TID, $FLAGS, $KEY, $VAL) - If $TID isn't already a resource group leader, it creates a sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants to it. - If $TID is already a resource group leader, set $KEY to $VAL. - If the process is moved to another cgroup, the sub-hierarchy is preserved. The reality is a bit more complex and cgroup core would need to handle implicit leaf cgroups and duplicating sub-hierarchy. The biggest complexity would be extending atomic multi-thread migrations to accomodate multiple targets but it already does atomic multi-task migrations and performing the migrations back-to-back should work. Controller side changes wouldn't be much. Copying configs to clone sub-hierarchy and specifying which are availble should be about it. This should give applications a simple and straight-forward interface to program against while avoiding all the issues with exposing cgroupfs directly to individual applications. > > So, the proposed patches already merge cpu and cpuacct, at least in > > appearance. Given the kitchen-sink nature of cpuset, I don't think it > > makes sense to fuse it with cpu. > > Arguments in favor of this: > a) Today the load-balancer has _no_ understanding of group level > cpu-affinity masks. > b) With SCHED_NUMA, we can benefit from also being able to apply (b) > to understand which nodes are usable. Controllers can cooperate with each other on the unified hierarchy - cpu can just query the matching cpuset css about the relevant attributes and the results will always be properly hierarchical for cpu too. There's no reason to merge the two controllers for that. Thanks. -- tejun