From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758799AbZCRS6O (ORCPT ); Wed, 18 Mar 2009 14:58:14 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757940AbZCRS5u (ORCPT ); Wed, 18 Mar 2009 14:57:50 -0400 Received: from smtp-out.google.com ([216.239.45.13]:40160 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758225AbZCRS5s (ORCPT ); Wed, 18 Mar 2009 14:57:48 -0400 DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type: content-transfer-encoding:x-system-of-record; b=JcP3kx474Byg+BzDYndZF48SKXpRf5C4VLfGqNlKXY074KCi4AeFgUwD8rhJL6Snv WVJWAI7SgbQR5Ikmeayeg== Message-ID: <49C1435B.1090809@google.com> Date: Wed, 18 Mar 2009 11:54:19 -0700 From: Mike Waychison User-Agent: Thunderbird 2.0.0.19 (X11/20090105) MIME-Version: 1.0 To: Oren Laadan CC: Linus Torvalds , Andrew Morton , linux-api@vger.kernel.org, containers@lists.linux-foundation.org, hpa@zytor.com, linux-kernel@vger.kernel.org, Dave Hansen , linux-mm@kvack.org, viro@zeniv.linux.org.uk, mingo@elte.hu, mpm@selenic.com, tglx@linutronix.de, Sukadev Bhattiprolu , Alexey Dobriyan , xemul@openvz.org Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do? References: <1234475483.30155.194.camel@nimitz> <20090212141014.2cd3d54d.akpm@linux-foundation.org> <1234479845.30155.220.camel@nimitz> <20090226155755.GA1456@x200.localdomain> <20090310215305.GA2078@x200.localdomain> <49B775B4.1040800@free.fr> <20090312145311.GC12390@us.ibm.com> <1236891719.32630.14.camel@bahia> <20090312212124.GA25019@us.ibm.com> <604427e00903122129y37ad791aq5fe7ef2552415da9@mail.gmail.com> <20090313053458.GA28833@us.ibm.com> <49BAC6AF.9090607@google.com> <49BADFCE.8020207@cs.columbia.edu> In-Reply-To: <49BADFCE.8020207@cs.columbia.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oren Laadan wrote: > > Mike Waychison wrote: >> Linus Torvalds wrote: >>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote: >>> >>>> Ying Han [yinghan@google.com] wrote: >>>> | Hi Serge: >>>> | I made a patch based on Oren's tree recently which implement a new >>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree >>>> | and it works as expected. >>>> >>>> Yes, I think we had a version of clone() with pid a while ago. >>> Are people _at_all_ thinking about security? >>> >>> Obviously not. >>> >>> There's no way we can do anything like this. Sure, it's trivial to do >>> inside the kernel. But it also sounds like a _wonderful_ attack vector >>> against badly written user-land software that sends signals and has small >>> races. >> I'm not really sure how this is different than a malicious app going off >> and spawning thousands of threads in an attempt to hit a target pid from >> a security pov. Sure, it makes it easier, but it's not like there is >> anything in place to close the attack vector. >> >>> Quite frankly, from having followed the discussion(s) over the last few >>> weeks about checkpoint/restart in various forms, my reaction to just about >>> _all_ of this is that people pushing this are pretty damn borderline. >>> >>> I think you guys are working on all the wrong problems. >>> >>> Let's face it, we're not going to _ever_ checkpoint any kind of general >>> case process. Just TCP makes that fundamentally impossible in the general >>> case, and there are lots and lots of other cases too (just something as >>> totally _trivial_ as all the files in the filesystem that don't get rolled >>> back). >> In some instances such as ours, TCP is probably the easiest thing to >> migrate. In an rpc-based cluster application, TCP is nothing more than >> an RPC channel and applications already have to handle RPC channel >> failure and re-establishment. >> >> I agree that this is not the 'general case' as you mention above >> however. This is the bit that sorta bothers me with the way the >> implementation has been going so far on this list. The implementation >> that folks are building on top of Oren's patchset tries to be everything >> to everybody. For our purposes, we need to have the flexibility of >> choosing *how* we checkpoint. The line seems to be arbitrarily drawn at >> the kernel being responsible for checkpointing and restoring all >> resources associated with a task, and leaving userland with nothing more >> than transporting filesystem bits. This approach isn't flexible enough: >> Consider the case where we want to stub out most of the TCP file >> descriptors with ECONNRESETed sockets because we know that they are RPC >> sockets and can re-establish themselves, but we want to use some other >> mechanism for TCP sockets we don't know much about. The current >> monolithic approach has zero flexibility for doing anything like this, >> and I figure out how we could even fit anything like this in. > > The flexibility exists, but wasn't spelled out, so here it is: > > 1) Similar to madvice(), I envision a cradvice() that could tell the c/r > something about specific resources, e.g.: > * cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's scratch > * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection on restart > etc .. (nevermind the exact interface right now) > > 2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint > or a restart complete successfully. At that time they can do their private > house-keeping if they know better. > > 3) If restoring some resource is significantly easier in user space (e.g. a > file-descriptor of some special device which user space knows how to > re-initialize), then the restarting task can prepare it ahead of time, > and, call: > * cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of trying > to restore it yourself. This would be called by the embryo process (mktree.c?) before calling sys_restart? > > Method #3 is what I used in Zap to implement distributed checkpoints, where > it is so much easier to recreate all network connections in user space then > putting that logic into the kernel. > > Now, on the other hand, doing the c/r from userland is much less flexible > than in the kernel (e.g. epollfd, futex state and much more) and requires > exposing tremendous amount of in-kernel data to user space. And we all know > than exposing internals is always a one-way ticket :( > > [...] > > Oren. > >