linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Hansen <dave@linux.vnet.ibm.com>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Theodore Tso <tytso@mit.edu>,
	Daniel Lezcano <daniel.lezcano@fr.ibm.com>,
	Arnd Bergmann <arnd@arndb.de>,
	containers@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org,
	Peter Chubb <peterc@gelato.unsw.edu.au>
Subject: Re: checkpoint/restart ABI
Date: Tue, 12 Aug 2008 09:46:59 -0700	[thread overview]
Message-ID: <1218559619.5598.97.camel@nimitz> (raw)
In-Reply-To: <48A1BB39.3090108@goop.org>

On Tue, 2008-08-12 at 09:32 -0700, Jeremy Fitzhardinge wrote:
> Inter-machine networking stuff is hard because its outside the 
> checkpointed set, so the checkpoint is observable.  Migration is easier, 
> in principle, because you might be able to shift the connection endpoint 
> without bringing it down.  Dealing with networking within your 
> checkpointed set is just fiddly, particularly remembering and restoring 
> all the details of things like urgent messages, on-the-fly file 
> descriptors, packet boundaries, etc.

All true.  Hard stuff.

The IBM product works partly by limiting migrations to occurring on a
single physical ethernet network.  Each container gets its own IP and
MAC address.  The socket state is checkpointed quite fully and moved
along with the IP.  

> > Unlinked files, for instance, are actually available in /proc.  You can
> > freeze the app, write a helper that opens /proc/1234/fd, then copies its
> > contents to a linked file (ooooh, with splice!)  Anyway, if we can do it
> > in userspace, we can surely do it in the kernel.
> 
> Sure, there's no inherent problem.  But do you imagine including the 
> file contents within your checkpoint image, or would they be saved 
> separately?

Me, personally, I think I'd probably "re-link" the thing, mark it as
such, ship it across like a normal file, then unlink it after the
restore.  I don't know what we'd choose when actually implementing it.  

> > I'm not sure what you mean by "closed files".  Either the app has a fd,
> > it doesn't, or it is in sys_open() somewhere.  We have to get the app
> > into a quiescent state before we can checkpoint, so we basically just
> > say that we won't checkpoint things that are *in* the kernel.
> 
> It's common for an app to write a tmp file, close it, and then open it a 
> bit later expecting to find the content it just wrote.  If you 
> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then 
> resume, then it will lose its tmp file.  There's no explicit connection 
> between the process and its potential working set of files.

I respectfully disagree.  The number one prerequisite for
checkpoint/restart is isolation.  Xen just happens to get this for free.
So, instead of saying that there's no explicit connection between the
process and its working set, ask yourself how we make a connection.

In this case, we can do it with a filesystem (mount) namespace.  Each
container that we might want to checkpoint must have its writable
filesystems contained to a private set that are not shared with other
containers.  Things like union mounts would help here, but aren't
necessarily required.  They just make it more efficient.

>   We had to 
> deal with it by setting a bunch of policy files to tell the 
> checkpoint/restart system what filename patterns it had to look out 
> for.  But if you just checkpoint the whole filesystem state along with 
> the process(es), then perhaps it isn't an issue.

Right.  We just start with "everybody has their own disk" which is slow
and crappy and optimize it from there.

> > Is there anything specific you are thinking of that particularly worries
> > you?  I could write pages on the list you have there.
> 
> No, that's the problem; it all worries me.  It's a big problem space.

It's almost as big of a problem as trying to virtualize entire machines
and expecting them to run as fast as native. :)

> > I don't want to get into a full virtualization vs. containers debate,
> > but we also want it for all the same reasons that you migrate Xen
> > partitions.
> >
> No, I don't have any real opinion about containers vs virtualization.  I 
> think they're quite distinct solutions for distinct problems.
> 
> But I was involved in the design and implementation of a 
> checkpoint-restart system (along with Peter Chubb), and have the scars 
> to prove it.  We implemented it for IRIX; we called it Hibernator, and 
> licensed it to SGI for a while (I don't remember what name they marketed 
> it under).  The list of problems that Peter and I mentioned are ones we 
> had to solve (or, in some cases, failed to solve) to get a workable system.

Cool!  I didn't know you guys did the IRIX implementation.  I'm sure you
guys got a lot farther than any of us are.  Did you guys ever write any
papers or anything on it?  I'd be interested in more information.

-- Dave


  reply	other threads:[~2008-08-12 16:47 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-08-07 22:40 [RFC][PATCH 0/4] kernel-based checkpoint restart Dave Hansen
2008-08-07 22:40 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Dave Hansen
2008-08-08  9:46   ` Arnd Bergmann
2008-08-08 18:50     ` Dave Hansen
2008-08-08 20:59       ` Oren Laadan
2008-08-08 22:17         ` Dave Hansen
2008-08-08 23:27           ` Oren Laadan
2008-08-08 22:23         ` Arnd Bergmann
2008-08-14  8:09         ` [Devel] " Pavel Emelyanov
2008-08-14 15:16           ` Dave Hansen
2008-08-08 22:13       ` Arnd Bergmann
2008-08-08 22:26         ` Dave Hansen
2008-08-08 22:39           ` Arnd Bergmann
2008-08-09  0:43             ` Dave Hansen
2008-08-09  6:37               ` Arnd Bergmann
2008-08-09 13:39                 ` Dave Hansen
2008-08-11 15:07           ` Serge E. Hallyn
2008-08-11 15:25             ` Arnd Bergmann
2008-08-14  5:53             ` Pavel Machek
2008-08-14 15:12               ` Dave Hansen
2008-08-20 21:40               ` Oren Laadan
2008-08-11 15:22         ` Serge E. Hallyn
2008-08-11 16:53           ` Arnd Bergmann
2008-08-11 17:11             ` Dave Hansen
2008-08-11 19:48             ` checkpoint/restart ABI Dave Hansen
2008-08-11 21:47               ` Arnd Bergmann
2008-08-11 23:14                 ` Jonathan Corbet
2008-08-11 23:23                   ` Dave Hansen
2008-08-21  5:56                 ` Oren Laadan
2008-08-21  8:43                   ` Arnd Bergmann
2008-08-21 15:43                     ` Oren Laadan
2008-08-11 21:54               ` Oren Laadan
2008-08-11 23:38               ` Jeremy Fitzhardinge
2008-08-11 23:54                 ` Peter Chubb
2008-08-12 14:49                   ` Serge E. Hallyn
2008-08-28 23:40                     ` Eric W. Biederman
2008-08-12 15:11                   ` Dave Hansen
2008-08-12 14:58                 ` Dave Hansen
2008-08-12 16:32                   ` Jeremy Fitzhardinge
2008-08-12 16:46                     ` Dave Hansen [this message]
2008-08-12 17:04                       ` Jeremy Fitzhardinge
2008-08-20 21:52                         ` Oren Laadan
2008-08-20 21:54                       ` Oren Laadan
2008-08-20 22:11                         ` Dave Hansen
2008-08-11 18:03   ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Jonathan Corbet
2008-08-11 18:38     ` Dave Hansen
2008-08-12  3:44       ` Oren Laadan
2008-08-18  9:26   ` [Devel] " Pavel Emelyanov
2008-08-20 19:10     ` Dave Hansen
2008-08-07 22:40 ` [RFC][PATCH 2/4] checkpoint/restart: x86 support Dave Hansen
2008-08-08 12:09   ` Arnd Bergmann
2008-08-08 20:28     ` Oren Laadan
2008-08-08 22:29       ` Arnd Bergmann
2008-08-08 23:04         ` Oren Laadan
2008-08-09  0:38           ` Dave Hansen
2008-08-09  1:20             ` Oren Laadan
2008-08-09  2:20               ` Dave Hansen
2008-08-09  2:35                 ` Oren Laadan
2008-08-10 14:55             ` Jeremy Fitzhardinge
2008-08-11 15:36               ` Dave Hansen
2008-08-11 16:07                 ` Jeremy Fitzhardinge
2008-08-09  6:43           ` Arnd Bergmann
2008-08-07 22:40 ` [RFC][PATCH 3/4] checkpoint/restart: memory management Dave Hansen
2008-08-08 12:12   ` Arnd Bergmann
2008-08-07 22:40 ` [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore Dave Hansen
2008-08-08 12:15   ` Arnd Bergmann
2008-08-08 20:33     ` Oren Laadan
2008-08-08  9:25 ` [RFC][PATCH 0/4] kernel-based checkpoint restart Arnd Bergmann
2008-08-08 18:06   ` Dave Hansen
2008-08-08 18:18     ` Arnd Bergmann
2008-08-08 19:44   ` Oren Laadan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1218559619.5598.97.camel@nimitz \
    --to=dave@linux.vnet.ibm.com \
    --cc=arnd@arndb.de \
    --cc=containers@lists.linux-foundation.org \
    --cc=daniel.lezcano@fr.ibm.com \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterc@gelato.unsw.edu.au \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).