linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CRAK: a process checkpoint/restart kernel module
@ 2001-05-21  9:57 Hua Zhong
  2001-05-23 16:26 ` Pavel Machek
  0 siblings, 1 reply; 5+ messages in thread
From: Hua Zhong @ 2001-05-21  9:57 UTC (permalink / raw)
  To: linux-kernel


This project has been there for over one year, and I've got quite a few
emails asking about it.  Before it becomes more reliable, I think letting
more people know about it is a good idea.  Thanks to those who ever
pushed me on it :-)

I guess many of you have already known about epckpt, a patch written
by Eduardo Pinheiro that adds process checkpoint/restart capability to the
Linux kernel.  CRAK does the similar thing - in fact, I started this
project based on epckpt's code, but now they have been very different.

The major differences are:

* CRAK is a kernel module (!!)
* CRAK doesn't do any bookkeeping (thus no run time overhead)
* CRAK uses different strategy to checkpoint parallel processes (user
space vs kernel space, and signal vs semaphore)

Moreover, I've successfully (in the sense of working for simple cases such
as telnet) added network socket support.  Due to some academic reasons I
have not put this portion of code online, but I'll do so as soon as
possible.

The main website is at http://www.cs.columbia.edu/~huaz/research/crak.htm.
It works for 2.2.19 and 2.4.4 (the latter is still beta).  You can also
learn more about checkpointing at http://www.checkpointing.org (maintained
by Eduardo Pinheiro).

Speaking of reliability, it's not 100% reliable.  Originally I wanted to
make it more reliable before annoucing it, and now I realized (and was
convinced) that letting people know about it earlier could make this goal
happen sooner.

All comments/praise/criticism are welcome.  Thanks.

----------------------------------------------------------------
Hua Zhong

Central Research Facilities	Department of Computer Science
Columbia University		New York, NY 10027
Email: huaz@cs.columbia.edu	http://www.cs.columbia.edu/~huaz
----------------------------------------------------------------




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CRAK: a process checkpoint/restart kernel module
  2001-05-21  9:57 CRAK: a process checkpoint/restart kernel module Hua Zhong
@ 2001-05-23 16:26 ` Pavel Machek
  2001-05-25 18:51   ` Hua Zhong
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Machek @ 2001-05-23 16:26 UTC (permalink / raw)
  To: Hua Zhong; +Cc: linux-kernel

Hi!!

> This project has been there for over one year, and I've got quite a few
> emails asking about it.  Before it becomes more reliable, I think letting
> more people know about it is a good idea.  Thanks to those who ever
> pushed me on it :-)
> 
> I guess many of you have already known about epckpt, a patch written
> by Eduardo Pinheiro that adds process checkpoint/restart capability to the
> Linux kernel.  CRAK does the similar thing - in fact, I started this
> project based on epckpt's code, but now they have been very different.

One question: can crak be used for process migration (assuming nodes
share filesystem)? [As in, node of
cluster is going down so we checkpoint and resume on some other node?]

								Pavel
PS: Can it checkpoint/restart X applications? I guess some games would
be easier with ability to checkpoint ;-)
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CRAK: a process checkpoint/restart kernel module
  2001-05-23 16:26 ` Pavel Machek
@ 2001-05-25 18:51   ` Hua Zhong
  2001-05-25 21:41     ` Pavel Machek
  0 siblings, 1 reply; 5+ messages in thread
From: Hua Zhong @ 2001-05-25 18:51 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Hua Zhong, linux-kernel


Please cc to me - I am currently off the list.

On Wed, 23 May 2001, Pavel Machek wrote:

> Hi!!
>
> One question: can crak be used for process migration (assuming nodes
> share filesystem)? [As in, node of
> cluster is going down so we checkpoint and resume on some other node?]

Yes, as long as the resources (opened files) can be accessed on both
nodes.

> PS: Can it checkpoint/restart X applications? I guess some games would
> be easier with ability to checkpoint ;-)

Which means we need to support migrating network sockets.  I added
TCP/IPv4 socket support this spring (currently for 2.2.19 and will port to
2.4 shortly), and I tested migrating X.  In certain cases I
successfully migrated some applications like Emacs, Acroread, etc, but
there is a problem.  (The socket migration code has not been put online,
but I'd like to discuss how it works here)

Basically I took three steps to migrate a TCP socket.  Assuming A and B
are the two peers:

1. shutdown process A while keep B open
2. restart A and re-establish the socket which points to B
3. change the socket on B to point to the new location of A

The problem is, during this stage, if B sends packets to A before 3 is
complete, B's socket will get a RST.  In the case of X, if you click or
move cursor on A's window when A is being migrated, it will crash.

One solution might be that freezing B when A is being migrated.  There are
two ways to freeze B:

1) send a SIGSTOP to B and later SIGCONT it.  It's simple to do but would
result in freezing the whole process, which is bad in certain cases (e.g.,
the whole X server is stopped - the screen freezes).

2) freeze the socket only.  I tried to set window sizes of B's socket to
zero, but it didn't work (I didn't try too hard though).  I'd like to know
whether there is a way to do so.

Unfortunately, even we use 1), it still doesn't solve the whole problem.
For exmaple, when the X connection is tunneled through ssh, you can only
freeze the sshd process, but packets are still sent to it when you click
on the server side, which will crash the connection as well (at least for
my current implementation).  One reason might be I didn't take care of
pending packets when I migrage a socket, but in fact, the real problem of
socket migration is that you don't know what would happen if the network
address is changed.  Appliactions may depend on it (such as FTP).  A
virtual network interface should be provided to solve the problem
gracefully.

As of migrating games, hmmm, here are my 2cents:

1) Most online games use UDP, and CRAK hasn't implemented UDP support.
It's much easier than TCP though.
2) I am not sure of what the effect would be if we changed the network
address.  Most games requires you to join a group before you start, and
maybe the group membership is based on network address.

At last, there are a lot of work left to do to make process migration work
truly reliably, and CRAK is still far from that.  For example, what if an
application depends on pid?  What if a process uses temporary files
(/tmp) which are not present on other nodes?  Or what if an application
deletes files that are still opened (evil programs like make)?  Not all of
these are possible, or possible without enough kernel cooperation.
Particularly hard when CRAK is just a kernel module.

I am still a lerner.  I wrote CRAK mostly for fun, but I'd like to
hear some advice from the kernel hacker community if people think it has
some value.

> --
> Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
> details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
>

----------------------------------------------------------------
Hua Zhong

Central Research Facilities	Department of Computer Science
Columbia University		New York, NY 10027
Email: huaz@cs.columbia.edu	http://www.cs.columbia.edu/~huaz
----------------------------------------------------------------



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CRAK: a process checkpoint/restart kernel module
  2001-05-25 18:51   ` Hua Zhong
@ 2001-05-25 21:41     ` Pavel Machek
  2001-05-26 12:20       ` Hua Zhong
  0 siblings, 1 reply; 5+ messages in thread
From: Pavel Machek @ 2001-05-25 21:41 UTC (permalink / raw)
  To: Hua Zhong; +Cc: linux-kernel

Hi!

> Please cc to me - I am currently off the list.

Ok.

> > One question: can crak be used for process migration (assuming nodes
> > share filesystem)? (As in, node of
> > cluster is going down so we checkpoint and resume on some other node?)
> 
> Yes, as long as the resources (opened files) can be accessed on both
> nodes.

Good.

> > PS: Can it checkpoint/restart X applications? I guess some games would
> > be easier with ability to checkpoint ;-)
> 
> Which means we need to support migrating network sockets. I added
> TCP/IPv4 socket support this spring (currently for 2.2.19 and will port to 
> 2.4 shortly), and I tested migrating X. In certain cases I
> successfully migrated some applications like Emacs, Acroread, etc, but
> there is a prob lem. (The socket migration code has not been put online,
> but I'd like to discuss how it works here)
> 
> Basically I took three steps to migrate a TCP  socket. Assuming A and B
> are the two peers:
> 
> 1. shutdown process A while keep B open
> 2. restart A and re-establish the socket which points to B
> 3 . change the socket on B to point to the new location of A

This assumes both A and B are on same machine, right?

> The problem is, during this stage, if B sends packets to A before 3 is
> complete, B's socket will get a RST. In the case of X, if you click or
> move cursor on A's window when A is being migrated, it will crash.

<EVIL SOLUTION>
You might shutdown machine's networking between checkpoint and
restart. That way, packets are silently lost, and there's no RST to be
generated.
</EVIL>

> One solution might be that freezing B when A is being migrated. There are
> two ways to freeze B:
> 
> 1) send a SIGSTOP to B and later SIGCONT it. It's simple to do but woul  d
> result in freezing the whole process, which is bad in certain cases (e.g.,
> the whole X server is stopped - the screen freezes).

Assuming they are on same machine.

> 2) freeze the socket only. I tried to set window sizes of B's socket to
> zero, but it didn't work (I didn't try too hard though). I'd like to know
> whether there i  s a way to do so.

You don't want to decrease window size, you want all packets silently
discarded.

> Unfortunately, even we use 1), it still doesn't solve the whole problem.
> For exmaple, when the X connection is tunneled through ssh, you can only
> freeze the sshd process, but packets are still sent to it when you click
> on the server side, which will crash the connection as ell (at least for
> my current implementation). One reason might be I didn't take care of
> pending packets when I migrage a socket, but in fact, the  real problem of
> socket migration is that you don't know what would happen if the network
> address is changed. Appliactions may depend on it (such a s FTP). A
> virtual network interface should be provided to solve the problem
> gracefully.
> 
> As of migrating games, hmmm, here are my 2cents:
> 
> 1) Most  online games use UDP, and CRAK hasn't implemented UDP support.
> It's much easier than TCP though.

I guess you can't checkpoint/restart when there's remote machine
involved. I was not thinking online games, I was thinking about
tuxracer (game on localhost).

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: CRAK: a process checkpoint/restart kernel module
  2001-05-25 21:41     ` Pavel Machek
@ 2001-05-26 12:20       ` Hua Zhong
  0 siblings, 0 replies; 5+ messages in thread
From: Hua Zhong @ 2001-05-26 12:20 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Hua Zhong, linux-kernel

On Fri, 25 May 2001, Pavel Machek wrote:

> > Basically I took three steps to migrate a TCP  socket. Assuming A and B
> > are the two peers:
> >
> > 1. shutdown process A while keep B open
> > 2. restart A and re-establish the socket which points to B
> > 3 . change the socket on B to point to the new location of A
>
> This assumes both A and B are on same machine, right?

No.  They can be on different machines.  That's why it's called
"migration" :-)

> > The problem is, during this stage, if B sends packets to A before 3 is
> > complete, B's socket will get a RST. In the case of X, if you click or
> > move cursor on A's window when A is being migrated, it will crash.
>
> <EVIL SOLUTION>
> You might shutdown machine's networking between checkpoint and
> restart. That way, packets are silently lost, and there's no RST to be
> generated.
> </EVIL>

That's what virtual network interface could be used for.  Packets sent to
A can be queued or discarded, whatever, if we have the control at the
interface level.  Actually one PhD student in my department has been
working on it, and CRAK is just part of the project.

> I guess you can't checkpoint/restart when there's remote machine
> involved. I was not thinking online games, I was thinking about
> tuxracer (game on localhost).

localhost is much easier, but the same problem still exists.

> --
> I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
> Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
>

----------------------------------------------------------------
Hua Zhong

Central Research Facilities	Department of Computer Science
Columbia University		New York, NY 10027
Email: huaz@cs.columbia.edu	http://www.cs.columbia.edu/~huaz
----------------------------------------------------------------



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2001-05-26 12:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-05-21  9:57 CRAK: a process checkpoint/restart kernel module Hua Zhong
2001-05-23 16:26 ` Pavel Machek
2001-05-25 18:51   ` Hua Zhong
2001-05-25 21:41     ` Pavel Machek
2001-05-26 12:20       ` Hua Zhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).