From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751128AbdBLFFv (ORCPT ); Sun, 12 Feb 2017 00:05:51 -0500 Received: from mail-pf0-f194.google.com ([209.85.192.194]:35225 "EHLO mail-pf0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750714AbdBLFFt (ORCPT ); Sun, 12 Feb 2017 00:05:49 -0500 Date: Sun, 12 Feb 2017 14:05:44 +0900 From: Tejun Heo To: Peter Zijlstra Cc: lizefan@huawei.com, hannes@cmpxchg.org, mingo@redhat.com, pjt@google.com, luto@amacapital.net, efault@gmx.de, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, lvenanci@redhat.com, Linus Torvalds , Andrew Morton Subject: Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Message-ID: <20170212050544.GJ29323@mtj.duckdns.org> References: <20170202200632.13992-1-tj@kernel.org> <20170203202048.GD6515@twins.programming.kicks-ass.net> <20170203205955.GA9886@mtj.duckdns.org> <20170206124943.GJ6515@twins.programming.kicks-ass.net> <20170208230819.GD25826@htj.duckdns.org> <20170209102909.GC6515@twins.programming.kicks-ass.net> <20170210154508.GA16097@mtj.duckdns.org> <20170210175145.GJ6515@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170210175145.GJ6515@twins.programming.kicks-ass.net> User-Agent: Mutt/1.7.1 (2016-10-04) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote: > Sure, we're past that. This isn't about what memcg can or cannot do. > Previous discussions established that controllers come in two shapes: > > - task based controllers; these are build on per task properties and > groups are aggregates over sets of tasks. Since per definition inter > task competition is already defined on individual tasks, its fairly > trivial to extend the same rules to sets of tasks etc.. > > Examples: cpu, cpuset, cpuacct, perf, pid, (freezer) > > - system controllers; instead of building from tasks upwards, they > split what previously would be machine wide / global state. For these > there is no natural competition rule vs tasks, and hence your > no-internal-task rule. > > Examples: memcg, io, hugetlb This is a bit of delta but as I wrote before, at least cpu (and accordingly cpuacct) won't stay purely task-based as we should account for resource consumptions which aren't tied to specific tasks to the matching domain (e.g. CPU consumption during writeback, disk encryption or CPU cycles spent to receive packets). > > And here's another point, currently, all controllers are enabled > > consecutively from root. If we have leaf thread subtrees, this still > > works fine. Resource domain controllers won't be enabled into thread > > subtrees. If we allow switching back and forth, what do we do in the > > middle while we're in the thread part? > > From what I understand you cannot re-enable a controller once its been > disabled, right? If you disable it, its dead for the entire subtree. cgroups on creation don't enable controllers by default and users can enable and disable controllers dynamically as long as the conditions are met. So, they can be disable and re-enabled. > > No matter what we do, it's > > gonna be more confusing and we lose basic invariants like "parent > > always has superset of control knobs that its child has". > > No, exactly that. I don't think I ever proposed something different. > > The "resource domain" flag I proposed violates the no-internal-processes > thing, but it doesn't violate that rule afaict. If we go to thread mode and back to domain mode, the control knobs for domain controllers don't make sense on the thread part of the tree and they won't have cgroup_subsys_state to correspond to either. For example, A - T - B B's memcg knobs would control memory distribution from A and cgroups in T can't have memcg knobs. It'd be weird to indicate that memcg is enabled in those cgroups too. We can make it work somehow. It's just weird-ass interface. > > As for the runtime overhead, if you get affected by adding a top-level > > cgroup in any measureable way, we need to fix that. That's not a > > valid argument for messing up the interface. > > I think cgroup tree depth is a more significant issue; because of > hierarchy we often do tree walks (uo-to-root or down-to-task). > > So creating elaborate trees is something I try not to do. So, as long as the depth stays reasonable (single digit or lower), what we try to do is keeping tree traversal operations aggregated or located on slow paths. There still are places that this overhead shows up (e.g. the block controllers aren't too optimized) but it isn't particularly difficult to make a handful of layers not matter at all. memcg batches the charging operations and it's impossible to measure the overhead of several levels of hierarchy. In general, I think it's important to ensure that this in general is the case so that users can use the logical layouts matching the actual resource hierarchy rather than having to twist the layout for optimization. > > Even if we allow switching back and forth, we can't make the same > > cgroup both resource domain && thread root. Not in a sane way at > > least. > > The back and forth thing yes, but even with a single level, the one > resource domain you tag will be both resource domain and thread root. Ah, you're right. Thanks. -- tejun