From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965786AbXCLMfd (ORCPT ); Mon, 12 Mar 2007 08:35:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S965790AbXCLMfd (ORCPT ); Mon, 12 Mar 2007 08:35:33 -0400 Received: from e1.ny.us.ibm.com ([32.97.182.141]:40267 "EHLO e1.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965786AbXCLMfc (ORCPT ); Mon, 12 Mar 2007 08:35:32 -0400 Date: Mon, 12 Mar 2007 18:12:26 +0530 From: Srivatsa Vaddagiri To: menage@google.com, xemul@sw.ru, dev@sw.ru, pj@sgi.com, sam@vilain.net, ebiederm@xmission.com, winget@google.com, serue@us.ibm.com, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, ckrm-tech@lists.sourceforge.net, containers@lists.osdl.org Subject: Summary of resource management discussion Message-ID: <20070312124226.GD17151@in.ibm.com> Reply-To: vatsa@in.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.11 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org I happened to read the entire thread (@ http://lkml.org/lkml/2007/3/1/159) all over again and felt it may be usefull to summarize the discussions so far. If I have missed any imp. points or falsely represented someone's view (unintentionally of course!), then I would be glad to be corrected. 1. Which task-grouping mechanism? [ This question is the most vital one that needs a consensus ] Resource management generally works by apply resource controls over a -group- of tasks (tasks of a user, tasks in a vserver/container etc). What mechanism do we use to group tasks for res mgmt purposes? Options: a. Paul Menage's container(/uh-a-diff-name-pls?) patches The patches introduce a new pointer in task_struct, struct container_group *containers, and a new structure 'struct container'. Tasks pointing to the same 'struct container' object (via their tsk->containers->container[] pointer) are considered to form a group associated with that container. The attributes associated with a container (ex: cpu_limit, rss_limit, cpus/mems_allowed) are decided by the options passed to mount command (which binds one/more/all resource controllers to a hierarchy). + For workload management, where it is desirable to manage resource consumption of a run-time defined (potentially arbitrary) group of tasks, then this patch is handy, as no existing pointers in task_struct can be used to form such a run-time decided group. - (subjective!) If there is a existing grouping mechanism already (say tsk->nsproxy[->pid_ns]) over which res control needs to be applied, then the new grouping mechanism can be considered redundant (it can eat up unnecessary space in task_struct) What may help avoid this redundancy is to re-build existing grouping mechanism (say tsk->nsproxy) using the container patches. Serge however expressed some doubts on such a implementation (for ex: how will one build hierarchical cpusets and non-hierarchical namespaces using that single 'grouping' pointer in task_struct) and also felt it may slow down things a bit from namespaces pov (more dereferences reqd to get to a task's namespace). b. Reuse existing pointers in task_struct, tsk->nsproxy or better perhaps tsk->nsproxy->pid_ns, as the means to group tasks (rcfs patches) This is based on the observation that the group of tasks whose resource consumption need to be managed is already defined in the kernel by existing pointers (either tsk->nsproxy or tsk->nsproxy->pid_ns) + reuses existing grouping mechanism in kernel - mixes resource and name spaces (?) c. Introduce yet-another new structure ('struct res_ctl?') which houses resource control (& possibly pid_ns?) parameters and a new pointer to this structure in task_struct (Herbert Poetzl). Tasks that have a pointer to the same 'struct res_ctl' are considered to form a group for res mgmt purpose + Accessing res ctl information in scheduler fast path is optimized (only two-dereferences required) - If all resource control parameters (cpu, memory, io etc) are lumped together in same structure, it makes it hard to have resource classes (cpu, mem etc) that are independent of each other. - If we introduce several pointers in task_struct to allow separation of resource classes, then it will increase storage space in task_struct and also fork time (we have to take ref count on more than one object now). Herbert thinks this is worthy tradeoff for the benefit gained in scheduler fast paths. 2. Where do we put resource control parameters for a group? This depends on 1. So the options are: a. Paul Menage's patches: (tsk->containers->container[cpu_ctlr.subsys_id] - X)->cpu_limit An optimized version of the above is: (tsk->containers->subsys[cpu_ctlr.subsys_id] - X)->cpu_limit b. rcfs tsk->nsproxy->ctlr_data[cpu_ctlr.subsys_id]->cpu_limit c. Herbert's proposal tsl->res_ctl->cpu_limit 3. How are cpusets related to vserver/containers? Should it be possible to, lets say, create exclusive cpusets and attach containers to different cpusets? 4. Interface Filesystem vs system call Filesystem: + natural way to represent hierarchical data + File permission model convenient to delegate management of part of a tree to one user + Ease of use with scripts (from Herbet Poetzl): - performance of filesystem interfaces is quite bad - you need to do a lot to make the fs consistant for e.g. find and friends (regarding links and filesize) - you have a quite hard time to do atomic operations (except for the ioctl interface, which nobody likes) - vfs/mnt namespaces complicate the access to this new filesystem once you start moving around (between the spaces) 5. If we use filesystem interface, then should it be in /proc? (Eric) - /proc doesn't allow the flexibility of say creating multiple hierarchies and binding different resource controllers to each hierarchy 6. As tasks move around namespaces/resource-classes, their tsk->nsproxy/containers object will change. Do we simple create a new nsproxy/containers object or optimize storage by searching for one which matches the task's new requirements? - Linux Vserver follows former approach i.e simply creates a new nsproxy with pointers to required namespace objects 7. Hierarchy - For res mgmt, do we need to worry about hierarchy at all? - If we consider cpuset to be a resource controller, then we have one resource controller who already supports hierarchy - If we don't support hierarchy in res controllers today but were to add that support later, then user-interface shouldn't change. That's why designining -atleast- the user interface to support hierarchy may make sense - Do we let resource classes to be split independent of each? For ex: CPU resource classes are independent of memory resource classes. This inturn affect whether the Paul Menage's patches need to support multiple hierarchy feature. -- Regards, vatsa