From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932095AbWAUKZ2 (ORCPT ); Sat, 21 Jan 2006 05:25:28 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932101AbWAUKZ2 (ORCPT ); Sat, 21 Jan 2006 05:25:28 -0500 Received: from ebiederm.dsl.xmission.com ([166.70.28.69]:64212 "EHLO ebiederm.dsl.xmission.com") by vger.kernel.org with ESMTP id S932095AbWAUKZ2 (ORCPT ); Sat, 21 Jan 2006 05:25:28 -0500 To: Hubertus Franke Cc: Dave Hansen , Greg KH , Alan Cox , "Serge E. Hallyn" , Arjan van de Ven , linux-kernel@vger.kernel.org, Cedric Le Goater Subject: Re: RFC [patch 13/34] PID Virtualization Define new task_pid api References: <20060117143258.150807000@sergelap> <20060117143326.283450000@sergelap> <1137511972.3005.33.camel@laptopd505.fenrus.org> <20060117155600.GF20632@sergelap.austin.ibm.com> <1137513818.14135.23.camel@localhost.localdomain> <1137518714.5526.8.camel@localhost.localdomain> <20060118045518.GB7292@kroah.com> <1137601395.7850.9.camel@localhost.localdomain> <43D14578.6060801@watson.ibm.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 21 Jan 2006 03:25:03 -0700 In-Reply-To: <43D14578.6060801@watson.ibm.com> (Hubertus Franke's message of "Fri, 20 Jan 2006 15:18:00 -0500") Message-ID: User-Agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hubertus Franke writes: > Eric W. Biederman wrote: > Let me just summarize the discussion that has taken place so far > and the consequences I/we seem to be drawing out of it. > > We discussed the various approaches that are floating around now, enough > has been said about each, so I leave it at that ... > (a) GLIBC intercept LD_PRELOAD > (b) Binary Rewrite of glibc > (c) syscall table intercept (see ZAP) > (d) vpid approach (see "IBM" patches posted) > (e) approach (see below, suggested by Alan?.. ) That seems to have been an observation of the current patchset. > > There are several issues that came up in the email exchange ( Arjen, Alan Cox, > .. ). > [ Please feel free to tell me if I/we captured or misinterpregin these wrong ] ... > Actions: The vpid_to_pid will disappear and the check for whether we are in the > same > container needs to be pushed down into the task lookup. question remains to > figure out > whether the context of the task lookup (will always remain the caller ?). You don't need a same container check. If something is in another container it becomes invisible to you. > Doing so has an implication, namely that we are moving over to "system > containers". > The current implementation requires the vpid/pid only for the boundary condition > at the > top of the container (to rewrite pid=1) and its parent and the fact that we > wanted > a global look through container=0. > If said boundary would be eliminated and we simply make a container a child of > the > initproc (pid=1), this would be unnecessary. > > all together this would provide private namespaces (as just suggested by Eric). > > The feeling would be that large parts of patch could be reduce by this. I concur. Except I think the initial impact could still be large. It may be worth breaking all users of pids just so we audit them. But that will certainly result in no long term cost, or runtime overhead. > What we need is a new system calls (similar to vserver) or maybe we can continue > the /proc approach for now... > > sys_exec_container(const *char container_name, pid_t pid, unsigned int flags, > const *char argv, const *char envp); > > exec_container creates a new container (if indicated in flags) and a new task in > it that reports to parent initproc. > if a non-zero pid is specified we use that pid, otherwise the system will > allocate it. Finally > it create new session id ; chroot and exec's the specified program. > > What we loose with this is the session and the tty, which Cedric described as > application > container... > > The sys_exec_container(...) seems to be similar to what Eric just called > clone_namespace() Similar. But I was actually talking about just adding another flag to sys_clone the syscall underlying fork(). Basically it is just another resource not share or not-share. Eric