All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
To: Sukadev Bhattiprolu
	<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: sqazi-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>,
	Pavel Emelyanov <xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Subject: Re: [LPC] Notes from Checkpoint/Restart BOF
Date: Mon, 12 Oct 2009 14:52:38 -0400	[thread overview]
Message-ID: <4AD37AF6.8010903@librato.com> (raw)
In-Reply-To: <20090929001754.GA19933-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Hi,

Thanks for posting the notes. I place a (modified) summary of the BOF
on the linux-c/r wiki:

	http://ckpt.wiki.kernel.org/index.php/LPC2009

Oren.

Sukadev Bhattiprolu wrote:
> 
> Notes from Checkpoint/Restart BOF at Linux Plumbers Conference, Sep 24, 2009.
> 
> (I am missing some details and couple of names. They said they were on
> Containers mailing list though. If you have any other topics that we
> discussed or have any details, please add to this mail).
> 
> ---
> 
> Attendees:
> 	Oren Laadan, Joeseph Ruscio, <One more person> (Librato)
> 	Pavel Emelyanov, <One more person ?> (OpenVZ)
> 	Ying Han, Salman Qazi (Google)
> 	Dan Smith, Matt Helsley, Sukadev Bhattiprolu (IBM)
> 
> 1. Pavel: A few months ago there were discussions about making a "dry-run"
>    to see if checkpoint of an application will succeed. What is the
>    current status of that ?
> 
> 	The answer was there is no dry-run - user should just try the
> 	actual C/R. If application is using an uncheckpointable resource
> 	the C/R will fail cleanly without side-effects. 
> 	The dry-run may not mean anything unless we freeze the application
> 	during the check and leave it frozen until the checkpoint is done.
> 	IOW, the dry-run does not guarantee that application is checkpointable
> 	unless the application is frozen.
> 
> 2. Pavel: Alexey Dobriyan had earlier submitted some code for leak-detection. Do
>    we still have that ?
> 
>    	The answer was that most of the code was used and we also added reverse
> 	detection.
> 
> 3. Do we have a config-option to make a process checkpointable.
> 
> 	<Missed the context of this question> We have CONFIG_CHECKPOINT.
> 
> 4 Checkpointing network connections:
> 
> 	We quickly reviewed the status (AF_UNIX done, AF_INET done in a
> 	prototype and needs to be forward ported). Checkpoint of one-end
> 	of a network connection can cause the connection to be reset.
> 
> 5. Briefly discussed distinction between Live migration and static migration
> 
> 6. Do we need a pre-check during restart to ensure that the application can
>    be restarted ? Eg: if the application used a specific math co-processor
>    or futex at checkpoint and that resource is not available at restart,
>    the restart may encounter some undefined behavior. Should we encode the
>    hardware/OS capabilities in the checkpoint image and check these
>    capabilities during restart (before actual restart). Reason for this
>    check being the restart may not fail cleanly if the resource is missing.
> 
>    	Conclusion was that there could be too many such capabilities that
> 	we would have to track and even so there may be some unexpected
> 	difference between checkpoint machine and restart machine.
> 
> 	For now, let the restart fail and/or deal with in user-space.
> 
> 7. Discussed briefly about clone2() aka clone_with_pids().
> 
> 	Everyone seemed to agree that restoring process-tree even in user-space
> 	will work and can be used.
> 
> 8. Oren: Error reporting during restart
> 
> 	We currently fail the system call with an error code and if we ant
> 	more information on the failure, we have to add debug messages to
> 	the code. We discussed couple of options for error reporting on restart:
> 		- log detailed message(s) to console (risk wrapping dmesg buf)
> 		- pass an extra-buffer to the system call and have kernel
> 		  fill-in more detailed error message (would need two new
> 		  parameters, one pointer to the buf, one size of the buf).
> 
> 		- Pass-in an extra 'log_fd' parameter to system call and have
> 		kernel write detailed messags to that log_fd (unless log_fd
> 		is -1). This seemed more flexible than the other two.
> 
> 		We agreed that the format of the log messages can be free-format
> 		and that there is no guarantee that the format of the log
> 		messages will not change.
> 
> 		But it was not clear (at least to me) if the log file should
> 		contain all log messages relating to the C/R or just the
> 		last (few) error messages.
> 
> 9. Any application to summarize the checkpoint ?
> 
> 	We have a 'ckptinfo' that could summarize the contents of a checkpoint.
> 
> 10. Ying Han: Is there a performance difference between the original instance
>     of the application and the restarted instance ? (Eg: on NUMA if application
>     was on one node at checkpoint and after restart, ended up on another node).
> 
>     	Not sure if there was a conclusion to this point.
> 
> 11. Discussed that devices like tty, /dev/rtc etc must be virtualized before
>     we can checkpoint them.
> 
> 12. Oren: Checkpointing/Restoring mount namespaces
> 
> 	Bind mounts are restored in container.
> 
> 	NFS: at least on OpenVZ, since network is frozen, reopening files over
> 	NFS is not possible until restart is complete. OpenVZ creates fake
> 	dentries to allow the open to proceed.
> 
> 	Loopback devices - cannot open them in a container since they can
> 		lockup system with huge memory footprint ??
> 
> 	We should disable shared-mount propogation at least for now.
> 
> 13. Oren: cradvise()
> 
> 	Use a single system call to optimize the checkpoint/restart ?
> 	Eg: If an fd refers to /dev/tty1 in the checkpoint-image and that tty
> 	is not available on restart, user-space could open another tty and
> 	teach the kernel to use a different tty, /dev/tty2, during
> 	restart. Another example is if an application has several megs of
> 	"scratch" memory  that does not need to checkpointed, they could
> 	use 'cradvise') system call to optimize the checkpoint or restart.
> 
> 	The conclusion was it would be hard to get acceptance from community,
> 	for a new variant of ioctl/fcntl call. So, we should instead try to
> 	add the necessary features to existing system calls like fcntl(),
> 	shmctl() or madvise().
> 
> 14. Oren: Unlinked files/directories
> 
> 	May need to copy the contents of the deleted file to the
> 	checkpoint image (only on ext4?). Create a fake hard link to the
> 	file so the file still exists in the filesystem snapshot and remove
> 	the link during restart.
> 
> 	There is a good paper discussing snapshot/restore of unlinked files
> 	on Xen. The same concept could be used in C/R too ?
> 
> 	(If you have links to the paper, please add)
> 
> 15. Network namespaces
> 
> 	Restore namespaces in user-space, restore sockets in-kernel.
> 
> 	Cannot create devices in user-space unless we know the index for
> 	the network device ?
> 
> 	(Missed details on this discussion)
> 
> 16. Time
> 
> 	Will need some policies on restart like:
> 		- use absolute time or relative time
> 		- do new children inherit the policy ?
> 		- do we gradually adjust from relative to absolute time ?
> 
> 	If not cradvise(), maybe timectl() :-p
> 
> 17. VDSO
> 
> 	(Missed details on this discussion)
> 
> 18. Async I/O
> 
> 	Getting a lockdep report during checkpoint ?
> 	OpenVZ flushes I/O, waits for pending I/O and then retries checkpoint
> 	We may need to the do the same for mmap I/O ?
> 
> 19. Checkpoint data structures:
> 
> 	- Try to keep extensions to existing data structures minimal
> 	- If necessary, add to end of data structures
> 	- But do not get locked down to an ABI at this point. i.e.  even after
> 	  entering mainline, format of checkpoint image may change for a while
> 	  before stabilizing.
> 
> 20. Test suite:
> 
> 	OpenVZ has some test cases that has various applications go to specific
> 	states and wait for a checkpoint. After that and after restart they
> 	check that nothing has changed unexpectedly.
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

      parent reply	other threads:[~2009-10-12 18:52 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-29  0:17 [LPC] Notes from Checkpoint/Restart BOF Sukadev Bhattiprolu
     [not found] ` <20090929001754.GA19933-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-12 18:52   ` Oren Laadan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4AD37AF6.8010903@librato.com \
    --to=orenl-rdfvbdnroixbdgjk7y7tuq@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
    --cc=sqazi-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    --cc=xemul-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.