From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760630AbaJ3RDh (ORCPT ); Thu, 30 Oct 2014 13:03:37 -0400 Received: from mail-qa0-f43.google.com ([209.85.216.43]:61731 "EHLO mail-qa0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758594AbaJ3RDf (ORCPT ); Thu, 30 Oct 2014 13:03:35 -0400 Date: Thu, 30 Oct 2014 13:03:31 -0400 From: Tejun Heo To: Peter Zijlstra Cc: Vikas Shivappa , "Auld, Will" , Matt Fleming , Vikas Shivappa , "linux-kernel@vger.kernel.org" , "Fleming, Matt" Subject: Re: Cache Allocation Technology Design Message-ID: <20141030170331.GB378@htj.dyndns.org> References: <20141029081640.GT3337@twins.programming.kicks-ass.net> <20141029124834.GQ12020@console-pimps.org> <20141029134526.GC3337@twins.programming.kicks-ass.net> <96EC5A4F3149B74492D2D9B9B1602C27349EEB88@ORSMSX105.amr.corp.intel.com> <20141029172845.GP12706@worktop.programming.kicks-ass.net> <20141029182234.GA13393@mtj.dyndns.org> <20141030070725.GG3337@twins.programming.kicks-ass.net> <20141030124333.GA29540@htj.dyndns.org> <20141030131845.GI3337@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141030131845.GI3337@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey, Peter. On Thu, Oct 30, 2014 at 02:18:45PM +0100, Peter Zijlstra wrote: > On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote: > > And that something shouldn't be disallowing task migration across > > cgroups. This simply doesn't work with co-mounting or unified > > hierarchy. cpuset automatically takes on the nearest ancestor's > > configuration which has enough execution resources. Maybe that can be > > an option for this too? > > It will give very random and nondeterministic behaviour and basically > destroy the entire purpose of the controller (which are the very same > reasons I detest that 'new' behaviour in cpusets). I agree with you that this is a corner case behavior which deviates from the usual behavior; however, the deviation is inherent. This stems from the fact that the kernel in general doesn't allow tasks which cannot be run. You say that you detest the new behaviors of cpuset; however, the old behaviors were just as sucky - bouncing tasks to an ancestor cgroup forcifully and without any indication or way to restore the previous configuration. What's different with the new behavior is that it explicitly distinguishes between the configured and effective configurations as the kernel isn't capable for actually enforcing certain subset of configurations. So, the inherent problem is always there no matter what we do and the question is that of a policy to deal with it. One of the main issues I see with failing cgroup-level operations for controller specific reasons is lack of visibility. All you can get out of a failed operation is a single error return and there's no good way to communicate why something isn't working, well not even who's the culprit. Having "effective" vs "configured" makes it explicit that the kernel isn't capable of honoring all configurations and make the details of the situation visible. Another part is inconsistencies across controllers. This sure is worse when there are multiple controllers involved but inconsistent behaviors across different hierarchies are annoying all the same with single controller multiple hierarchies. Userland often manages some of those hierarchies together and it can get horribly confusing. No matter what, we need to settle on a single policy and having effective configuration seems like the better one. > > One of the problems is that we generally assume that a task can run > > some point in time in a lot of places in the kernel and can't just not > > run a task indefinitely because it's in a cgroup configured certain > > way. > > Refusing tasks into a previously empty cgroup creates no such problems. > Its already in a cgroup (wherever its parent was) and it can run there, > failing to move it to another does not affect things. Yeah, sure, hard failing can work too. It didn't work well for cpuset because a runnable configuration may become not so if the system config changes afterwards but this probably doesn't have an issue like that. I'm not saying something like the above won't work. It'd but I don't think that's the right place to fail. This controller might not even require the distinction between configured and effective tho? Can't a new child just inherit the parent's configuration and never allow the config to become completely empty? The problem cpuset faces is that of underlying hardware configuration changing. This one doesn't have that. > > > So either we fail mkdir, but that means allocating CLOS IDs for possibly > > > empty cgroups, or we allocate on demand which means failing task > > > assignment. > > > > Can't fail mkdir or css enabling either. Again, co-mounting and > > unified hierarchy. Also, the behavior is just horrible to use from > > userland. > > In order to fix the co-mounting and unified hierarchy I still need to > hear a proposal for that tasks vs processes thing. > > Traditionally the cgroups were task based, but many controllers are > process based (simply because what they control is process wide, not per > task), and there was talk (2-3 years ago or so) about making the entire > cgroup thing per process, which obviously fails for all scheduler > related cgroups. Yeah, it needs to be a separate interface where a given userland task can access its own knobs in a race-free way (cgroup interface can't even do that) whether that's a pseudo filesystem, say, /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless of what happens with cgroup. cgroup simply isn't a suitable mechanism to expose these types of knobs to individual userland threads. > > Yeah, RT is one of the main items which is problematic, more so > > because it's currently coupled with the normal sched controller and > > the default config doesn't have any RT slice. > > Simply because you cannot give a slice on creation; or if you did that > would mean failing mkdir when a new cgroup would exceed the available > time. > > Also any !0 slice is wrong because it will not match the requirements of > the proposed workload, the administrator will have to set it to match > the workload. > > Therefore 0. As long as RT is separate from normal sched controller, this *could* be fine. The main problem now is that userland which wants to use the cpu controller but doesn't want to fully manage RT slices end up disabling RT slices. It might work if a new child can share the parent's slice till explicitly configured. Another problem is when you wanna change the configuration after the hierarchy is already populated. I don't know. I'd even be happy with cgroup not having anything to do with RT slice distribution. Do you have any ideas which can make RT slice distribution more palatable? If we can't decouple the two, we'd be effectively requiring whoever is managing the cpu controller to also become a full-fledged RT slice arbitrator, which might actually work too. > > Do we completely block RT task w/o slice? Is that okay? > > We will not allow an RT task in, the write to the tasks file will fail. > > The same will be true for deadline tasks, we'll fail entry into a cgroup > when the combined requirements of the tasks exceed the provisions of the > group. > > There is just no way around that and still provide sane semantics. Can't a task just lose RT / deadline properties when migrating into a different RT / deadline domain? We already modify task properties on migration for cpuset after all. It'd be far simpler that way. Thanks. -- tejun